Monday, October 27, 2008
Fielding Forecasts
Courtesy of Rally, and he does why I think we all should do: regress toward the Fans’, not zero.
Buy The Book from Amazon
Courtesy of Rally, and he does why I think we all should do: regress toward the Fans’, not zero.
Brian tested it, for hitting, as:
T^.7
or so
So, last year would be .7, the year before would be .5, the year before that would be .35.
If you are getting .25 for two years ago, I’d like to see the assumptions of your sample that would lead to that.
I don’t have any fresh baseball data handy, but the hockey study is recent and the conclusions are similar.
Predicting next year’s points based on the points from the last two seasons, making no adjustment to scoring levels or whether the player even played in both seasons resulted in an equation of: .602 (last year) + .214 (two years ago). I got similar numbers for baseball, but always thought the heavy weight on the most recent year was to account for rookies/players with very little playing time two years ago. But ... weighting the numbers by fewest games played over the three-year period (including the forecast year, so that a player who played 80 games in two seasons and then only 5 in the forecast year doesn’t skew the results) results in .660 (last year) + .279 (two years ago). Not much different.
If you start restricting your dataset to just full-time players, then your weights start to flatten toward the conventional 5-4-3.
I’ll try and come up with some robust baseball data over the next few weeks.
I can believe the hockey, if for no other reasons than ice time, and ice time on the powerplay. Those are huge parameters.
In any case, I look forward to whatever you’ve got…
Tango, I ran the numbers this afternoon—predicting year 4 runs created based on RC in years 1-3. The data covers all players from 1920 to 2007 with at least 1PA in four consecutive years, and runs created were rebased to a 4.50 rc/27 out league. There are about 25,000 data points.
For all players, year 4 RC = .629 (year 3) + .196 (year 2) + .029 (year 1).
If you restrict the data to players with at least 200 PAs in each of the last 3 years, then year 4 = .579 (year 3) + .228 (year 2) + .088 (year 1)
Upping the minimum cutoff to 500PAs each year, the formula is .560 (year 3) + .239 (year 2) + .110 (year 1).
There’s a bit of built-in regression there as the sum of the weights is below 1.00.
If anyone else is interested in this topic, here’s a spreadsheet to play around with: http://www.editgrid.com/user/dackle/forecasting_min1. Year and age in columns D and E correspond to “y” (columns J and P), “y-1” is one year ago, “y-2” is two years ago etc. “Min” is the minimum of “y-1”, “y-2” and “y-3”.
Dackle, try it with RC/27. That should help stabilize the amount of regression you’re applying.
Great job. My initial reaction to your first list was that since RC is both a rate and playing time stat:
RC ~= PA * (wOBA/1.15 - .18)
And so, the most recent season may be biased towards those guys with the most playing time.
However, your third equation, forcing at least 500 PA in each of three years, removed that issue from me.
If you add up the three coefficients (.56, .24, .11), that gives you .91, meaning that you have 9% regression toward the mean. How much should we have expected? Presuming an average of 600 PA each year, if we use weights of 100%, 80%, 64%, then we have a weighted PA of 1440, and our regression toward the mean equation (that Marcel uses) is 240 / (1440 + 240) = 14% . A little too much regression.
Furthermore, since we know the weights should be 56/24/11, that would mean: 100%, 43%, 20%. 600 PA each year means a weighted PA of 978. A 9% regression toward the mean would imply: 100 / (100+978) = 9% regression.
This would mean that having just 100 PA means you can regress 50% toward the mean.
The numbers just don’t seem right, and runs contrary to what I got when I first created the Marcels some six or so years ago, and also seems at odds with what Brian got.
I’ll take a look at it more later in the week to confirm, or find something else.
Actually, I tried projecting RC/PA, weighted by either (1) minimum PAs over entire forecast period (ie y, y-1, y-2 and y-3), or (2) minimum PAs over the “known” years (y-1, y-2, y-3—when we’re forecasting we don’t know the exact PAs in year y). So, if we’ve got a player like this ...
Year PAs RC RC/PA 2003 500 100 .200 2004 400 80 .200 2005 500 80 .160 2006 100 15 .150
Anyway, for (1), the equation was:
.436 (y-1) + .300 (y-2) + .234 (y-3) + .082
For (2), it was:
.447 (y-1) + .305 (y-2) + .194 (y-3) + .438
I don’t know how to interpret those intercepts in terms of the implied regression.
So I think a lot of the reason why the numbers reported earlier are steeper than the traditional 5-4-3 is because the playing time forecast is mixed in with the raw RC totals. The regression on PAs alone for the dataset is: .673 (y-1) + .155 (y-2) + .021 (y-3) + 26.8.
I suppose that supports the idea of forecasting the rate and then multiplying by the expected playing time, instead of just applying weights to the raw totals.
Aug 31 15:28
Fans Scouting Report: Update
Sep 02 15:44
The two uncertainties of UZR
Sep 02 15:17
Mail: rWAR v fWAR
Sep 02 14:59
Roger Federer
Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are
Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?
Sep 02 14:15
WOWY Teachers
Sep 02 13:37
Who’s Waldo?
Sep 02 08:36
Team Elin
Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?
Could I get a little bit nitpicky here, even though it has little to do with the actual article? I see this a lot...:
These are based on 5 years of data from 2004-2008, weighted 1, .8, .6, .4, .2..
... and it always strikes me as very arbitrary. If you have the data handy, I wonder what the results would be if you regressed 2004-2007 to project 2008? I have a feeling you won’t get 1, .8, .6 etc.
Just to compare, predicting next year’s point total for hockey players using the last two years, I get weights of: .66 (last year) + .28 (two years ago)
For baseball I get similar numbers—around .60-.70 for the most recent year (depending on the study’s assumptions), .20-.30 for two years ago, and you can almost throw the third year out the window without losing much.