Tuesday, February 15, 2011
Testing the 2007-2010 Forecasting Systems - Official Results
Marcel, Chone, Oliver, Pecota, Zips.
Background
You can skip all this if you don’t want to be bored with historic details.
A few weeks ago, Brian Cartwright, creator of the Oliver forecasting system, sent me an email. He had informed me that he collected the past published forecasting systems for Chone, Marcel, Pecota, and Zips for 2007-2010. He went through the arduous task of linking each forecast for each player to a common ID. And he asked for my advice in terms of testing.
I told him that it doesn’t matter what I say, that because he was the one doing the analysis, it’s going to be taken with a grain of salt. As it should, because no matter how honest you intend to be, there’s a potential for bias, however unintended. And publishing bias demands that if you do well, you publish the results, and if you don’t do well, you don’t publish the results (or you publish only from particular angles).
Chone was created by Sean Smith, and he has a vested interest in it doing well. Pecota was created by Nate Silver and now owned by Baseball Prospectus (currently run by Colin Wyers), and they have a vested interest in it doing well. Zips was created by Dan Syzmborski, and he has a vested interest in it doing well. Oliver is Brian’s baby as noted, and he has a vested interest in it doing well. In all cases, the creators tinker or othewise make wholesale changes to their systems every year. They are constantly looking for an improvement.
Marcel was conceived by yours truly at Tangotiger.net. Its algorithm was published seven years ago. The first results were published seven years ago. And there has been zero changes to the algorithm since it was first published seven years ago. Though occasionally Marcel begs for Ponn Farr, I don’t allow it. The purpose of Marcel was to serve as a baseline, a minimum competence level against which all systems would be tested against. I have no vested interest in it doing well. Indeed, quite the opposite, I want the others to beat Marcel.
So, Brian sent me his dataset. I am treating this dataset as it being an assumption of fact. This means that not only am I accepting that Brian properly collected everyone’s forecasts, but that he also properly assigned the unique player ID that serves to link everyone’s forecast for each player. This is something that I have done many times in the past, it is a monumental effort to do, and had no desire to do once more. I am IDed-out.
I should also note that I correspond occasionally with all the creators. Other potential systems that could have been included were Bill James’, Ron Shandler’s, Mitchel Lichtman’s (published only on request), Pete Palmer’s (published only on request), with all of whom I also correspond occasionally. The only reason they are not in this test is that Brian did not collect their forecasts and/or didn’t link a player ID to their forecasts.
One other important note: while the forecasts for Chone, Marcel, Pecota, and Zips were collected at the time they were published, the Oliver “forecasts” were based on the 2011 engine, and therefore are not true forecasts. It’s what the forecast would have been had Oliver published its forecasts based on the 2011 engine. Because of this, the reader is free to discard any and all results of Oliver that you will see below. I did contact the creators of the other systems and offered them the opportunity to use their current 2011 engines and recreate past forecasts. None accepted this offer so far.
Details
You can skip all this if you don’t want to be bored with how I set up the data.
STEP 1
The first thing I decided to do is look only at the common elements being forecasted. This means that my universe of stats was limited to:
AB, H, 2B, 3B, HR, BB
And that’s it.
STEP 2
I did have to resolve one small issue, and that was that Chone, and only in 2010, excluded IBB from his BB forecasts. I tried two different methods:
a. Simply treated everyone’s forecasted IBB for 2010 for Chone as 0, and therefore, let stand his BB forecast as if it included IBB
b. Used Marcel’s forecasted IBB for 2010 for Chone.
I am happy to say that regardless of which way it went, it conferred no advantage to Chone’s overall results. Because of this, I decided to go with option b.
STEP 3
I also had to resolve one bigger issue, and that was that not all forecaster provided forecasts for every player.
Marcel has an explicit note on its forecasting page that any player that is not in its downloadable forecasting file is treated as if he has a league average forecast. The other systems don’t have that provision. So, what to do, what to do.
Well, what I decided upon was the following course of action. I looked at what the actual results were of the players not forecasted for each system. And, to no surprise, those missing players ended up performing at a bench level performance, with an average wOBA of .310 to .320. Therefore, I decided to give each of the missing players a fixed forecast of .315 (while maintaining .335 for Marcel). I should note that it didn’t really matter what I did here. I could have made it .310 or .335, and the results barely budged. This is because each of the forecasting systems forecasted a vast majority of the players.
STEP 4
Then I had to decide on a metric, and that was simple enough: wOBA. wOBA is nothing more than Linear Weights, and Linear Weights is the best metric for batting. The equation I used was:
0.7*[BB]+0.9*[H]+0.35*[DO]+0.65*[TR]+1.1*[HR]
divided by
AB+BB
As a reminder, BB includes IBB, and excludes hit batters.
STEP 5
Finally, I had to align everyone’s forecast to a common baseline. Suppose Marcel forecasts Ibanez for a .340 wOBA in a league of .340, and Chone forecasts the same player with a .340 wOBA in a league of .330, and Ibanez comes in at a .338 wOBA in a league of .338.
The way I describe this situation is that Marcel is its own universe. It could have forecasted Ibanez for a .350 wOBA in a league of .350, and it would make no difference. Marcel forecasted Ibanez to be a league average player. We really don’t care if we forecast what the entire league is going to do, because the league is dependent on such externalities as the weather, the strike zone, the ball, and the bat. None of which we care about for forecasting (unless we can isolate it to a particular player or park).
So, I reset each system’s forecast such that their population mean was .335, for each and every year. I reset the actual results at a wOBA of .335 for each and every year. I did this with a simple subtraction or addition. The weight of the wOBA was based on the actual PA in the year in question.
Details
Don’t skip this part. It’s the reason you are here.
In the tests that follow, the metric of choice is the absolute error between the forecast and the actual, weighted by the actual number of plate appearances (PA). No player results has been discarded.
TEST #1 - OVERALL
There were 2480 hitters in my dataset, totalling 707,694 PA (AB+BB). This is their ranking:
0.0272 Chone
0.0273 Oliver
0.0277 Zips
0.0278 Marcel
0.0280 Pecota
TEST #2 - SEASONAL
This is the breakdown by season:
Season Chone Marcel Oliver Pecota Zips ALL
2007 0.0281 0.0280 0.0277 0.0277 0.0279 0.0279
2008 0.0263 0.0276 0.0269 0.0274 0.0272 0.0271
2009 0.0261 0.0275 0.0265 0.0286 0.0274 0.0272
2010 0.0283 0.0281 0.0279 0.0283 0.0285 0.0282
ALL 0.0272 0.0278 0.0273 0.0280 0.0277 0.0276
The one that stands out the most is Pecota 2009. There was a transition period between Nate Silver to Clay Davenport to Colin Wyere right around that time. While obviously the results are the results, and any reader relying on the 2009 Pecota forecasts would have felt the pinch, it is possible that the results are not indicative of the overall strength of the system. If I change the results for Pecota 2009 from an error of .0286 to .0275 (to match the worst in the group for that year, Marcel), the overall average error for Pecota 2007-2010 falls to .0277, tied with Zips, and just ahead of Marcel.
Once again, the reader can choose to either treat Pecota in this result as either being equal to Zips, or a bit worse than Marcel, however he wants to think.
TEST #3 - RELIABILITY
This is where the fun starts, and this is where we’re going to learn something.
The most important part of a forecasting system is how much to weight past performance. Or in other words, how much regression toward the mean are we going to apply?
In its forecasting file, Marcel publishes exactly how much regression toward the mean it has applied, and it is directly tied in to how many past plate appearances (PA). I broke the 2480 players into 4 groups:
A. Had substantial past history (n=1165)
B. Had some past history (n=356)
C. Had very little past history (n=562)
D. Had no past history (n=397)
For you Marcelites following along, I used thresholds of r=.7 and .5 to set the boundaries. So, group A has High Reliability. Group B has Medium Reliability. Group C has Low Reliability. Group D are the pure rookies, players that Marcel set to a fixed wOBA of .335 and by definition has No Reliability.
First let me show you the results only for Marcel:
Rel n PA Actual Marcel Error
No 397 41911 0.319 0.335 0.042
Low 562 81066 0.313 0.326 0.037
Med 356 89475 0.326 0.325 0.030
High 1165 495242 0.342 0.338 0.025
We see that for the 1165 players (totalling 495,242 PA) with High Reliability, Marcel was, overall, fine. They actually produced a .342 wOBA, Marcel forecasted .338, and the average error for these players was .025.
We see, as the reliability gets worse, the error goes up. With the 356 players of Medium Reliability, Marcel forecasted overall .326, they produced overall .325, and the error went up to .030.
With the 562 players of Low Reliability, Marcel tanked. Marcel forecasted .326, but they were actually only .313. The error jumped to .037.
Finally, with the 397 of No Reliability, Marcel as you will remember simply forecasted everyone at .335. They actually produced a bench-level .319, but the average error was .042. Basically, this is as bad as it gets.
Now, let’s see how each of the systems does in each reliability class.
TEST #3a - High Reliability
First up, the High Reliability players, in alphabetical order of system:
wOBA Error System
0.342 0.0000 Actual
0.343 0.0243 Chone
0.338 0.0248 Marcel
0.341 0.0245 Oliver
0.342 0.0252 Pecota
0.343 0.0248 Zips
We see that other than Marcel, all the systems go the group average on the button, from .341 to .343, where the actual was .342. However, the average error was pretty much identical across the board, from .0243 to .0252. The basic rule of thumb here is that if you have a large history of performance for a player, it’s pretty hard to find some way to beat Marcel.
TEST #3b - Medium Reliability
wOBA Error System
0.326 0.0000 Actual
0.324 0.0291 Chone
0.325 0.0296 Marcel
0.326 0.0296 Oliver
0.325 0.0296 Pecota
0.323 0.0306 Zips
Here we see there’s very little to distinguish between the systems. Marcel holds its own extremely well, even though we are talking about players with limited playing time. Generally speaking, a player with a reliability of between .50 and .70 means that he has about 300 to 700 PA over the past three seasons.
TEST #3c - Low Reliability
wOBA Error System
0.313 0.0000 Actual
0.314 0.0341 Chone
0.326 0.0371 Marcel
0.318 0.0345 Oliver
0.312 0.0359 Pecota
0.312 0.0339 Zips
Here we see that Marcel sticks out like a sore thumb. We see that this group of hitters hit as .313, and the other four systems forecasted this group of players to hit between .312 and .318. Marcel forecasted .326.
Why did Marcel do terribly here? Simply put: Marcel does not look at minor league stats. These 4 other systems do. More importantly though: do we even need the minor league stats? Had Marcel simply regressed toward a population mean of .315 instead of .335, it would have improved its lot much better, and fallen right in step with the other systems. Indeed, even though it did a terrible job at regressing toward the wrong mean, the overall average error of Marcel was only .037, compared to the .034 to .036 of the other systems.
TEST #3d - No Reliability
Hold on to your hats here. Ready? These are the players that are Pure Rookies. They had no prior MLB history for any system to draw from. Marcel decided to give a blanket .335 forecast for each player, while the other four systems relied on their minor league stats.
wOBA Error System
0.319 0.0000 Actual
0.306 0.0436 Chone
0.335 0.0416 Marcel
0.320 0.0414 Oliver
0.313 0.0430 Pecota
0.307 0.0439 Zips
First off, we see forecasts all over the place. While the group of Pure Rookies hit .319 wOBA, the other four systems forecasted .306 to .320. Marcel of course was exactly .335.
But, look at the error term: Marcel nearly won! And Chone, which was leading in each sub-category took a bit of a hit here. Chone, along with Zips forecasted the overall mean too low, and the error term were the highest. Not that any of the systems really redeemed themselves here.
Interlude - Selection Bias
Now, why is this the case? Welcome to the world of selection bias. Let’s say that you have a system that does a great job in figuring out each player’s true talent level. In some cases, you are somewhat low, and in other cases, you are somewhat high. But overall, you are pretty good.
Now, in the cases that you are somewhat low, alot of those players will get called up to play in MLB. They will be given the playing time. And, they will show that they are pretty good.
In the cases taht you are somewhat high, very few of those players will get called up to play in MLB, because of scouting reports. They will be given limited playing time. And, they will show that they are not that good. We have something like this:
Of players where scouting reports and statistical analysis differ:
Scouting Reports much better
.310 = forecasted wOBA
.330 = actual wOBA
Scouting Reports much worse
.310 = forecasted wOBA
.290 = actual wOBA
So, you look at the two, and figure “ok, in either case, the forecasted wOBA is .310, and the average is halfway between .290 and .330, or .310”. What’s the problem? Let me add one more number:
Scouting Reports much better
.310 = forecasted wOBA
.330 = actual wOBA
300 = actual average number of plate appearances
Scouting Reports much worse
.310 = forecasted wOBA
.290 = actual wOBA
50 = actual average number of plate appearances
The first group, the group that the team trusts more, are given more playing time. The second group will barely be given a chance beyond a September callup.
Since we weight the results by actual PA, the numbers in the first group will count for more in the weighted average.
Overall, it looks like this:
.310 = forecasted wOBA
.324 = actual wOBA
As you can see, the forecasted wOBA will be lower than the actual wOBA. (All numbers in this particular section for illustration purposes only.)
Indeed, if you want to beat a forecasting system, you simply set a floor forecast wOBA of about .270 for an middle infielder or catcher, .290 for a centerfielder or third baseman, and .310 for a 1B or corner outfielder. Why is that? Because if they are worse than that, they won’t be given the chance to play. So, even if you know that a hitter is much worse than that, don’t both forecasting much lower than that. (All numbers in this particular section for illustration purposes only.) You would do the same thing on the pitching side. If a pitcher gives up runs at worse than 125% of the league average he will not remain in MLB. It’s that simple. So, it makes no sense to forecast someone at 150% of league average. You simply will never have the chance to get your forecast being proven correct. Well, it will be proven correct by the player not playing in MLB. But with IP=0, the weight=0 and so, it won’t count!
I know, I know, it’s terrible, right? That’s selection bias for you.
Because playing time is linked to talent, you can’t treat playing time as an independent parameter.
Remember that.
So, what do we do here? I have no idea. I’m presenting the data, and I’m breaking down the results to look for possibilities of bias. We’ve found one. I don’t know how to resolve this issue in terms of putting everyone on an equal footing.
Think about that: by KEEPING the players that Marcel has ZERO reliability on, it IMPROVES its overall ranking!
TEST #4 - Quality of players
I broke up the players based on how high or low Marcel forecasted the players. I used thresholds of .380, .350, .320, and .290. Here’s how Marcel fared:
Class Actual Marcel Error n PA
0 0.319 0.335 0.042 397 41911
1 0.288 0.283 0.029 116 21679
2 0.310 0.308 0.028 703 162343
3 0.335 0.334 0.027 948 312960
4 0.360 0.361 0.026 246 126855
5 0.397 0.394 0.025 70 41946
Class 5 is those players that Marcel forecasted with a wOBA of at least .380. There were only 70 such players in 4 years (totalling 41,946 PA). The overall average forecast was .394, but they performed at .397. The average error was only .025.
Marcel nailed the next three groups of players, getting the overall average within 1 or 2 wOBA points, and the average error between 26 and 28 wOBA points. For the lower quality players, they performed a bit better than Marcel forecasted (.288 actual to .283 forecasted). The average error went up to 29 points.
The Class=0 players are the Pure Rookies, and we know all about them. How did each system do against these same groups of players?
TEST #4a - Marcel forecast of .380+ (great hitters)
Because we are using Marcel, it may have some bias and confer an advantage to Marcel. Because of that, the focus should be on the other 4 forecasting systems in each fo the breakdowns that follow.
wOBA Error System
0.397 0.0000 Actual
0.400 0.0251 Chone
0.394 0.0250 Marcel
0.406 0.0265 Oliver
0.396 0.0250 Pecota
0.403 0.0262 Zips
We see that Oliver was over-optimistic in this class of hitters, coming in at .406. The average error for Chone and PECOTA led the way, with a small advantage over the other systems. My guess is that those systems don’t regress enough for this quality of players.
TEST #4b - Marcel forecast of .350 - .380 (good hitters)
wOBA Error System
0.360 0.0000 Actual
0.361 0.0251 Chone
0.361 0.0255 Marcel
0.363 0.0256 Oliver
0.362 0.0246 Pecota
0.365 0.0254 Zips
Pecota does a good job in minimizing the error, but overall, all the systems hold their own well.
TEST #4c - Marcel forecast of .320 - .350 (average hitters)
wOBA Error System
0.335 0.0000 Actual
0.335 0.0262 Chone
0.334 0.0272 Marcel
0.334 0.0264 Oliver
0.335 0.0279 Pecota
0.335 0.0266 Zips
Now the tables are turned a bit a Pecota brings up the rear, but generally speaking, no standouts here.
TEST #4d - Marcel forecast of .290 - .320 (fair hitters)
wOBA Error System
0.310 0.0000 Actual
0.312 0.0268 Chone
0.308 0.0278 Marcel
0.308 0.0268 Oliver
0.310 0.0278 Pecota
0.309 0.0279 Zips
Chone continues to add to its tiny advantage, but again, not much to learn here.
TEST #5d - Marcel forecast of .under .290 (poor hitters)
wOBA Error System
0.288 0.0000 Actual
0.289 0.0287 Chone
0.283 0.0288 Marcel
0.284 0.0265 Oliver
0.286 0.0281 Pecota
0.282 0.0284 Zips
Chone gets the overall mean the best, while the others seem to be too pessimistic. This result is somewhat consistent with the “best hitters” forecast, and that is that the other systems generally don’t regress enough. the average error hhowever favors those systems anyway! This may be due to the selection bias issues noted earlier.
TEST #6 - Full Breakdown by Reliability and Quality
I offer this breakdown with no commentary, and somewhat poor formatting. Copy / paste into your spreadsheet program.
Reliability PureRookie Quality Actual Chone Marcel Oliver Pecota Zips Chone Marcel Oliver Pecota Zips n PA
1_Low 1 3 0.319 0.306 0.335 0.320 0.313 0.307 0.044 0.042 0.041 0.043 0.044 397 41911
1_Low 0 1 0.289 0.296 0.287 0.296 0.289 0.288 0.024 0.022 0.019 0.021 0.023 19 1939
1_Low 0 2 0.300 0.306 0.310 0.307 0.300 0.303 0.036 0.037 0.035 0.038 0.037 223 25947
1_Low 0 3 0.318 0.317 0.332 0.321 0.316 0.314 0.034 0.036 0.035 0.036 0.033 298 45582
1_Low 0 4 0.330 0.329 0.357 0.339 0.337 0.333 0.031 0.046 0.033 0.030 0.031 22 7598
2_Med 0 1 0.284 0.286 0.282 0.282 0.284 0.282 0.032 0.035 0.029 0.032 0.031 41 6480
2_Med 0 2 0.314 0.310 0.307 0.309 0.308 0.306 0.034 0.036 0.034 0.035 0.037 155 28339
2_Med 0 3 0.333 0.331 0.332 0.333 0.334 0.330 0.028 0.027 0.028 0.028 0.028 133 40529
2_Med 0 4 0.348 0.348 0.361 0.354 0.350 0.354 0.022 0.020 0.026 0.023 0.024 26 13474
2_Med 0 5 0.377 0.378 0.394 0.393 0.397 0.378 0.001 0.018 0.016 0.020 0.001 1 653
3_High 0 1 0.289 0.290 0.283 0.282 0.286 0.282 0.028 0.027 0.026 0.027 0.028 56 13260
3_High 0 2 0.311 0.314 0.308 0.308 0.312 0.311 0.023 0.023 0.023 0.023 0.023 325 108057
3_High 0 3 0.339 0.339 0.334 0.336 0.339 0.340 0.024 0.025 0.024 0.026 0.025 517 226849
3_High 0 4 0.364 0.366 0.362 0.366 0.366 0.369 0.025 0.025 0.025 0.024 0.025 198 105783
3_High 0 5 0.398 0.401 0.394 0.406 0.396 0.403 0.025 0.025 0.027 0.025 0.027 69 41293
TEST #7 - Extremes
I counted how often a forecast came within 10 points of the actual wOBA. I called that a “great forecast”. I also counted how often a forecast missed by at least 40 points of the actual wOBA. I called that a “useless forecast”.
Great Useless System
0.253 0.218 Chone
0.260 0.235 Marcel
0.249 0.228 Oliver
0.236 0.235 Pecota
0.249 0.232 Zips
As we can see, Marcel got the most great forecasts. In terms of the Great/Useless ratio, Chone performed the best, with Marcel coming in second. Pecota had as many great forecasts as it did useless ones.
TEST #8 - Head-to-Head
Seeing that Chone was performing the best, I decided to do a head-to-head of Chone to Marcel for the 2480 individal forecasts.
I counted as a win for a system if it had a forecasted error that was at least 10 points better than its opponent, and that its forecasted error x 1.5 was less than its opponent. As an example, if Chone’s forecasted error for Evan Longoria was 25 points and Marcel’s was 36 points, I called that a tie. If it was 15 and 26, I called that a win.
I weighted the “games” by the number of actual PA.
They tied in 74% of the matchups. That is, in 74% of the forecasts, the forecasts were too close to call. Counting ties as half a win, the overall win% of Marcel was .501 against Chone. That is, pretty much, the two systems were equal by this measure.
CONCLUSION
To the extent that we want to declare a winner, Chone seems to have done the best, overall, and in some various breakdowns.
But, all the four systems performed very well, both against each other and against Marcel. The strong showing of Marcel serves as a cautionary note that it would be very difficult for a forecasting system to stand alone as the best. They are all very close, and there’s little to distinguish between them.



Good stuff.
These really are close. Tango, do you have the data set (better yet, do you have the dataset including individual stats like BB, H, HR…)?