THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, October 27, 2008

Fielding Forecasts

By Tangotiger, 11:37 AM

Courtesy of Rally, and he does why I think we all should do: regress toward the Fans’, not zero.


#1    Dackle      (see all posts) 2008/10/28 (Tue) @ 04:14

Could I get a little bit nitpicky here, even though it has little to do with the actual article? I see this a lot...:

These are based on 5 years of data from 2004-2008, weighted 1, .8, .6, .4, .2..

... and it always strikes me as very arbitrary. If you have the data handy, I wonder what the results would be if you regressed 2004-2007 to project 2008? I have a feeling you won’t get 1, .8, .6 etc.

Just to compare, predicting next year’s point total for hockey players using the last two years, I get weights of: .66 (last year) + .28 (two years ago)

For baseball I get similar numbers—around .60-.70 for the most recent year (depending on the study’s assumptions), .20-.30 for two years ago, and you can almost throw the third year out the window without losing much.


#2    Tangotiger      (see all posts) 2008/10/28 (Tue) @ 09:34

Brian tested it, for hitting, as:
T^.7
or so

So, last year would be .7, the year before would be .5, the year before that would be .35.

If you are getting .25 for two years ago, I’d like to see the assumptions of your sample that would lead to that.


#3    Dackle      (see all posts) 2008/10/28 (Tue) @ 12:13

I don’t have any fresh baseball data handy, but the hockey study is recent and the conclusions are similar.

Predicting next year’s points based on the points from the last two seasons, making no adjustment to scoring levels or whether the player even played in both seasons resulted in an equation of: .602 (last year) + .214 (two years ago). I got similar numbers for baseball, but always thought the heavy weight on the most recent year was to account for rookies/players with very little playing time two years ago. But ... weighting the numbers by fewest games played over the three-year period (including the forecast year, so that a player who played 80 games in two seasons and then only 5 in the forecast year doesn’t skew the results) results in .660 (last year) + .279 (two years ago). Not much different.

If you start restricting your dataset to just full-time players, then your weights start to flatten toward the conventional 5-4-3.

I’ll try and come up with some robust baseball data over the next few weeks.


#4    Tangotiger      (see all posts) 2008/10/28 (Tue) @ 12:39

I can believe the hockey, if for no other reasons than ice time, and ice time on the powerplay.  Those are huge parameters.

In any case, I look forward to whatever you’ve got…


#5    Dackle      (see all posts) 2008/11/02 (Sun) @ 15:24

Tango, I ran the numbers this afternoon—predicting year 4 runs created based on RC in years 1-3. The data covers all players from 1920 to 2007 with at least 1PA in four consecutive years, and runs created were rebased to a 4.50 rc/27 out league. There are about 25,000 data points.

For all players, year 4 RC = .629 (year 3) + .196 (year 2) + .029 (year 1).

If you restrict the data to players with at least 200 PAs in each of the last 3 years, then year 4 = .579 (year 3) + .228 (year 2) + .088 (year 1)

Upping the minimum cutoff to 500PAs each year, the formula is .560 (year 3) + .239 (year 2) + .110 (year 1).

There’s a bit of built-in regression there as the sum of the weights is below 1.00.

If anyone else is interested in this topic, here’s a spreadsheet to play around with: http://www.editgrid.com/user/dackle/forecasting_min1. Year and age in columns D and E correspond to “y” (columns J and P), “y-1” is one year ago, “y-2” is two years ago etc. “Min” is the minimum of “y-1”, “y-2” and “y-3”.


#6    Colin Wyers      (see all posts) 2008/11/02 (Sun) @ 18:11

Dackle, try it with RC/27. That should help stabilize the amount of regression you’re applying.


#7    tangotiger      (see all posts) 2008/11/02 (Sun) @ 22:33

Great job.  My initial reaction to your first list was that since RC is both a rate and playing time stat:
RC ~= PA * (wOBA/1.15 - .18)
And so, the most recent season may be biased towards those guys with the most playing time.

However, your third equation, forcing at least 500 PA in each of three years, removed that issue from me.

If you add up the three coefficients (.56, .24, .11), that gives you .91, meaning that you have 9% regression toward the mean.  How much should we have expected?  Presuming an average of 600 PA each year, if we use weights of 100%, 80%, 64%, then we have a weighted PA of 1440, and our regression toward the mean equation (that Marcel uses) is 240 / (1440 + 240) = 14% .  A little too much regression.

Furthermore, since we know the weights should be 56/24/11, that would mean: 100%, 43%, 20%.  600 PA each year means a weighted PA of 978.  A 9% regression toward the mean would imply: 100 / (100+978) = 9% regression.

This would mean that having just 100 PA means you can regress 50% toward the mean.

The numbers just don’t seem right, and runs contrary to what I got when I first created the Marcels some six or so years ago, and also seems at odds with what Brian got.

I’ll take a look at it more later in the week to confirm, or find something else.


#8    Dackle      (see all posts) 2008/11/02 (Sun) @ 23:46

Actually, I tried projecting RC/PA, weighted by either (1) minimum PAs over entire forecast period (ie y, y-1, y-2 and y-3), or (2) minimum PAs over the “known” years (y-1, y-2, y-3—when we’re forecasting we don’t know the exact PAs in year y). So, if we’ve got a player like this ...

Year  PAs  RC  RC/PA
2003  500 100  .200
2004  400  80  .200
2005  500  80  .160
2006  100  15  .150

... in approach (1) I multiplied all four RC/PA figures by 100, and in approach (2) by 400.

Anyway, for (1), the equation was:

.436 (y-1) + .300 (y-2) + .234 (y-3) + .082

For (2), it was:

.447 (y-1) + .305 (y-2) + .194 (y-3) + .438

I don’t know how to interpret those intercepts in terms of the implied regression.

So I think a lot of the reason why the numbers reported earlier are steeper than the traditional 5-4-3 is because the playing time forecast is mixed in with the raw RC totals. The regression on PAs alone for the dataset is: .673 (y-1) + .155 (y-2) + .021 (y-3) + 26.8.

I suppose that supports the idea of forecasting the rate and then multiplying by the expected playing time, instead of just applying weights to the raw totals.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 15:44
The two uncertainties of UZR

Sep 02 15:17
Mail: rWAR v fWAR

Sep 02 14:59
Roger Federer

Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?