THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


2013 Bill James Handbook

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, February 15, 2011

Testing the 2007-2010 Forecasting Systems - Official Results

By Tangotiger, 05:27 PM

Marcel, Chone, Oliver, Pecota, Zips.


Background

You can skip all this if you don’t want to be bored with historic details.

A few weeks ago, Brian Cartwright, creator of the Oliver forecasting system, sent me an email.  He had informed me that he collected the past published forecasting systems for Chone, Marcel, Pecota, and Zips for 2007-2010.  He went through the arduous task of linking each forecast for each player to a common ID.  And he asked for my advice in terms of testing.

I told him that it doesn’t matter what I say, that because he was the one doing the analysis, it’s going to be taken with a grain of salt.  As it should, because no matter how honest you intend to be, there’s a potential for bias, however unintended.  And publishing bias demands that if you do well, you publish the results, and if you don’t do well, you don’t publish the results (or you publish only from particular angles).

Chone was created by Sean Smith, and he has a vested interest in it doing well.  Pecota was created by Nate Silver and now owned by Baseball Prospectus (currently run by Colin Wyers), and they have a vested interest in it doing well.  Zips was created by Dan Syzmborski, and he has a vested interest in it doing well.  Oliver is Brian’s baby as noted, and he has a vested interest in it doing well.  In all cases, the creators tinker or othewise make wholesale changes to their systems every year.  They are constantly looking for an improvement.

Marcel was conceived by yours truly at Tangotiger.net.  Its algorithm was published seven years ago.  The first results were published seven years ago.  And there has been zero changes to the algorithm since it was first published seven years ago.  Though occasionally Marcel begs for Ponn Farr, I don’t allow it.  The purpose of Marcel was to serve as a baseline, a minimum competence level against which all systems would be tested against.  I have no vested interest in it doing well.  Indeed, quite the opposite, I want the others to beat Marcel. 

So, Brian sent me his dataset.  I am treating this dataset as it being an assumption of fact.  This means that not only am I accepting that Brian properly collected everyone’s forecasts, but that he also properly assigned the unique player ID that serves to link everyone’s forecast for each player.  This is something that I have done many times in the past, it is a monumental effort to do, and had no desire to do once more.  I am IDed-out.

I should also note that I correspond occasionally with all the creators.  Other potential systems that could have been included were Bill James’, Ron Shandler’s, Mitchel Lichtman’s (published only on request), Pete Palmer’s (published only on request), with all of whom I also correspond occasionally.  The only reason they are not in this test is that Brian did not collect their forecasts and/or didn’t link a player ID to their forecasts.

One other important note: while the forecasts for Chone, Marcel, Pecota, and Zips were collected at the time they were published, the Oliver “forecasts” were based on the 2011 engine, and therefore are not true forecasts.  It’s what the forecast would have been had Oliver published its forecasts based on the 2011 engine.  Because of this, the reader is free to discard any and all results of Oliver that you will see below.  I did contact the creators of the other systems and offered them the opportunity to use their current 2011 engines and recreate past forecasts.  None accepted this offer so far.

Details

You can skip all this if you don’t want to be bored with how I set up the data.

STEP 1
The first thing I decided to do is look only at the common elements being forecasted.  This means that my universe of stats was limited to:
AB, H, 2B, 3B, HR, BB

And that’s it. 

STEP 2
I did have to resolve one small issue, and that was that Chone, and only in 2010, excluded IBB from his BB forecasts.  I tried two different methods:
a. Simply treated everyone’s forecasted IBB for 2010 for Chone as 0, and therefore, let stand his BB forecast as if it included IBB
b. Used Marcel’s forecasted IBB for 2010 for Chone.
I am happy to say that regardless of which way it went, it conferred no advantage to Chone’s overall results.  Because of this, I decided to go with option b.

STEP 3
I also had to resolve one bigger issue, and that was that not all forecaster provided forecasts for every player.

Marcel has an explicit note on its forecasting page that any player that is not in its downloadable forecasting file is treated as if he has a league average forecast.  The other systems don’t have that provision.  So, what to do, what to do.

Well, what I decided upon was the following course of action.  I looked at what the actual results were of the players not forecasted for each system.  And, to no surprise, those missing players ended up performing at a bench level performance, with an average wOBA of .310 to .320.  Therefore, I decided to give each of the missing players a fixed forecast of .315 (while maintaining .335 for Marcel).  I should note that it didn’t really matter what I did here. I could have made it .310 or .335, and the results barely budged.  This is because each of the forecasting systems forecasted a vast majority of the players.


STEP 4
Then I had to decide on a metric, and that was simple enough: wOBA.  wOBA is nothing more than Linear Weights, and Linear Weights is the best metric for batting.  The equation I used was:
0.7*[BB]+0.9*[H]+0.35*[DO]+0.65*[TR]+1.1*[HR]
divided by
AB+BB

As a reminder, BB includes IBB, and excludes hit batters.

STEP 5
Finally, I had to align everyone’s forecast to a common baseline.  Suppose Marcel forecasts Ibanez for a .340 wOBA in a league of .340, and Chone forecasts the same player with a .340 wOBA in a league of .330, and Ibanez comes in at a .338 wOBA in a league of .338.

The way I describe this situation is that Marcel is its own universe.  It could have forecasted Ibanez for a .350 wOBA in a league of .350, and it would make no difference.  Marcel forecasted Ibanez to be a league average player.  We really don’t care if we forecast what the entire league is going to do, because the league is dependent on such externalities as the weather, the strike zone, the ball, and the bat.  None of which we care about for forecasting (unless we can isolate it to a particular player or park).

So, I reset each system’s forecast such that their population mean was .335, for each and every year.  I reset the actual results at a wOBA of .335 for each and every year.  I did this with a simple subtraction or addition.  The weight of the wOBA was based on the actual PA in the year in question.

Details

Don’t skip this part.  It’s the reason you are here.

In the tests that follow, the metric of choice is the absolute error between the forecast and the actual, weighted by the actual number of plate appearances (PA).  No player results has been discarded.

TEST #1 - OVERALL
There were 2480 hitters in my dataset, totalling 707,694 PA (AB+BB).  This is their ranking:

0.0272 Chone
0.0273 Oliver
0.0277 Zips
0.0278 Marcel
0.0280 Pecota

TEST #2 - SEASONAL

This is the breakdown by season:

Season	Chone	Marcel	Oliver	Pecota	Zips	ALL
2007	 0.0281 	 0.0280 	 0.0277 	 0.0277 	 0.0279 	 0.0279 
2008	 0.0263 	 0.0276 	 0.0269 	 0.0274 	 0.0272 	 0.0271 
2009	 0.0261 	 0.0275 	 0.0265 	 0.0286 	 0.0274 	 0.0272 
2010	 0.0283 	 0.0281 	 0.0279 	 0.0283 	 0.0285 	 0.0282 
ALL	 0.0272 	 0.0278 	 0.0273 	 0.0280 	 0.0277 	 0.0276

The one that stands out the most is Pecota 2009.  There was a transition period between Nate Silver to Clay Davenport to Colin Wyere right around that time.  While obviously the results are the results, and any reader relying on the 2009 Pecota forecasts would have felt the pinch, it is possible that the results are not indicative of the overall strength of the system.  If I change the results for Pecota 2009 from an error of .0286 to .0275 (to match the worst in the group for that year, Marcel), the overall average error for Pecota 2007-2010 falls to .0277, tied with Zips, and just ahead of Marcel.

Once again, the reader can choose to either treat Pecota in this result as either being equal to Zips, or a bit worse than Marcel, however he wants to think.

TEST #3 - RELIABILITY

This is where the fun starts, and this is where we’re going to learn something.

The most important part of a forecasting system is how much to weight past performance.  Or in other words, how much regression toward the mean are we going to apply?

In its forecasting file, Marcel publishes exactly how much regression toward the mean it has applied, and it is directly tied in to how many past plate appearances (PA).  I broke the 2480 players into 4 groups:

A. Had substantial past history (n=1165)
B. Had some past history (n=356)
C. Had very little past history (n=562)
D. Had no past history (n=397)

For you Marcelites following along, I used thresholds of r=.7 and .5 to set the boundaries.  So, group A has High Reliability.  Group B has Medium Reliability.  Group C has Low Reliability.  Group D are the pure rookies, players that Marcel set to a fixed wOBA of .335 and by definition has No Reliability.

First let me show you the results only for Marcel:

Rel	n	PA	Actual	Marcel	Error
No	397	41911	 0.319 	 0.335 	 0.042 
Low	562	81066	 0.313 	 0.326 	 0.037 
Med	356	89475	 0.326 	 0.325 	 0.030 
High	1165	495242	 0.342 	 0.338 	 0.025

We see that for the 1165 players (totalling 495,242 PA) with High Reliability, Marcel was, overall, fine.  They actually produced a .342 wOBA, Marcel forecasted .338, and the average error for these players was .025.

We see, as the reliability gets worse, the error goes up.  With the 356 players of Medium Reliability, Marcel forecasted overall .326, they produced overall .325, and the error went up to .030.

With the 562 players of Low Reliability, Marcel tanked.  Marcel forecasted .326, but they were actually only .313.  The error jumped to .037.

Finally, with the 397 of No Reliability, Marcel as you will remember simply forecasted everyone at .335.  They actually produced a bench-level .319, but the average error was .042.  Basically, this is as bad as it gets.

Now, let’s see how each of the systems does in each reliability class.

TEST #3a - High Reliability

First up, the High Reliability players, in alphabetical order of system:

wOBA	Error	System
 0.342 	0.0000	Actual
 0.343 	 0.0243 	Chone
 0.338 	 0.0248 	Marcel
 0.341 	 0.0245 	Oliver
 0.342 	 0.0252 	Pecota
 0.343 	 0.0248 	Zips

We see that other than Marcel, all the systems go the group average on the button, from .341 to .343, where the actual was .342.  However, the average error was pretty much identical across the board, from .0243 to .0252.  The basic rule of thumb here is that if you have a large history of performance for a player, it’s pretty hard to find some way to beat Marcel.

TEST #3b - Medium Reliability

wOBA	Error	System
0.326	0.0000	Actual
0.324	0.0291	Chone
0.325	0.0296	Marcel
0.326	0.0296	Oliver
0.325	0.0296	Pecota
0.323	0.0306	Zips

Here we see there’s very little to distinguish between the systems.  Marcel holds its own extremely well, even though we are talking about players with limited playing time.  Generally speaking, a player with a reliability of between .50 and .70 means that he has about 300 to 700 PA over the past three seasons.

TEST #3c - Low Reliability

wOBA	Error	System
0.313	0.0000	Actual
0.314	0.0341	Chone
0.326	0.0371	Marcel
0.318	0.0345	Oliver
0.312	0.0359	Pecota
0.312	0.0339	Zips

Here we see that Marcel sticks out like a sore thumb.  We see that this group of hitters hit as .313, and the other four systems forecasted this group of players to hit between .312 and .318.  Marcel forecasted .326.

Why did Marcel do terribly here?  Simply put: Marcel does not look at minor league stats.  These 4 other systems do.  More importantly though: do we even need the minor league stats?  Had Marcel simply regressed toward a population mean of .315 instead of .335, it would have improved its lot much better, and fallen right in step with the other systems.  Indeed, even though it did a terrible job at regressing toward the wrong mean, the overall average error of Marcel was only .037, compared to the .034 to .036 of the other systems. 

TEST #3d - No Reliability

Hold on to your hats here.  Ready?  These are the players that are Pure Rookies.  They had no prior MLB history for any system to draw from.  Marcel decided to give a blanket .335 forecast for each player, while the other four systems relied on their minor league stats.

wOBA	Error	System
0.319	0.0000	Actual
0.306	0.0436	Chone
0.335	0.0416	Marcel
0.320	0.0414	Oliver
0.313	0.0430	Pecota
0.307	0.0439	Zips

First off, we see forecasts all over the place.  While the group of Pure Rookies hit .319 wOBA, the other four systems forecasted .306 to .320.  Marcel of course was exactly .335.

But, look at the error term: Marcel nearly won!  And Chone, which was leading in each sub-category took a bit of a hit here.  Chone, along with Zips forecasted the overall mean too low, and the error term were the highest.  Not that any of the systems really redeemed themselves here.

Interlude - Selection Bias
Now, why is this the case?  Welcome to the world of selection bias.  Let’s say that you have a system that does a great job in figuring out each player’s true talent level.  In some cases, you are somewhat low, and in other cases, you are somewhat high.  But overall, you are pretty good.

Now, in the cases that you are somewhat low, alot of those players will get called up to play in MLB.  They will be given the playing time.  And, they will show that they are pretty good.

In the cases taht you are somewhat high, very few of those players will get called up to play in MLB, because of scouting reports.  They will be given limited playing time.  And, they will show that they are not that good.  We have something like this:

Of players where scouting reports and statistical analysis differ:

Scouting Reports much better
.310 = forecasted wOBA
.330 = actual wOBA

Scouting Reports much worse
.310 = forecasted wOBA
.290 = actual wOBA

So, you look at the two, and figure “ok, in either case, the forecasted wOBA is .310, and the average is halfway between .290 and .330, or .310”.  What’s the problem?  Let me add one more number:
Scouting Reports much better
.310 = forecasted wOBA
.330 = actual wOBA
300 = actual average number of plate appearances

Scouting Reports much worse
.310 = forecasted wOBA
.290 = actual wOBA
50 = actual average number of plate appearances

The first group, the group that the team trusts more, are given more playing time.  The second group will barely be given a chance beyond a September callup.

Since we weight the results by actual PA, the numbers in the first group will count for more in the weighted average.

Overall, it looks like this:
.310 = forecasted wOBA
.324 = actual wOBA

As you can see, the forecasted wOBA will be lower than the actual wOBA.  (All numbers in this particular section for illustration purposes only.)

Indeed, if you want to beat a forecasting system, you simply set a floor forecast wOBA of about .270 for an middle infielder or catcher, .290 for a centerfielder or third baseman, and .310 for a 1B or corner outfielder.  Why is that?  Because if they are worse than that, they won’t be given the chance to play.  So, even if you know that a hitter is much worse than that, don’t both forecasting much lower than that. (All numbers in this particular section for illustration purposes only.)  You would do the same thing on the pitching side.  If a pitcher gives up runs at worse than 125% of the league average he will not remain in MLB.  It’s that simple.  So, it makes no sense to forecast someone at 150% of league average.  You simply will never have the chance to get your forecast being proven correct.  Well, it will be proven correct by the player not playing in MLB.  But with IP=0, the weight=0 and so, it won’t count!

I know, I know, it’s terrible, right?  That’s selection bias for you. 

Because playing time is linked to talent, you can’t treat playing time as an independent parameter.

Remember that.

So, what do we do here?  I have no idea.  I’m presenting the data, and I’m breaking down the results to look for possibilities of bias.  We’ve found one.  I don’t know how to resolve this issue in terms of putting everyone on an equal footing.

Think about that: by KEEPING the players that Marcel has ZERO reliability on, it IMPROVES its overall ranking!

TEST #4 - Quality of players

I broke up the players based on how high or low Marcel forecasted the players.  I used thresholds of .380, .350, .320, and .290.  Here’s how Marcel fared:

Class	Actual	Marcel	Error	n	PA
0	 0.319 	 0.335 	 0.042 	397	41911
1	 0.288 	 0.283 	 0.029 	116	21679
2	 0.310 	 0.308 	 0.028 	703	162343
3	 0.335 	 0.334 	 0.027 	948	312960
4	 0.360 	 0.361 	 0.026 	246	126855
5	 0.397 	 0.394 	 0.025 	70	41946

Class 5 is those players that Marcel forecasted with a wOBA of at least .380.  There were only 70 such players in 4 years (totalling 41,946 PA).  The overall average forecast was .394, but they performed at .397. The average error was only .025.

Marcel nailed the next three groups of players, getting the overall average within 1 or 2 wOBA points, and the average error between 26 and 28 wOBA points.  For the lower quality players, they performed a bit better than Marcel forecasted (.288 actual to .283 forecasted).  The average error went up to 29 points.

The Class=0 players are the Pure Rookies, and we know all about them.  How did each system do against these same groups of players?

TEST #4a - Marcel forecast of .380+ (great hitters)

Because we are using Marcel, it may have some bias and confer an advantage to Marcel.  Because of that, the focus should be on the other 4 forecasting systems in each fo the breakdowns that follow.

wOBA	Error	System
 0.397 	0.0000	Actual
 0.400 	 0.0251 	Chone
 0.394 	 0.0250 	Marcel
 0.406 	 0.0265 	Oliver
 0.396 	 0.0250 	Pecota
 0.403 	 0.0262 	Zips

We see that Oliver was over-optimistic in this class of hitters, coming in at .406.  The average error for Chone and PECOTA led the way, with a small advantage over the other systems.  My guess is that those systems don’t regress enough for this quality of players.

TEST #4b - Marcel forecast of .350 - .380 (good hitters)

wOBA	Error	System
 0.360 	0.0000	Actual
 0.361 	 0.0251 	Chone
 0.361 	 0.0255 	Marcel
 0.363 	 0.0256 	Oliver
 0.362 	 0.0246 	Pecota
 0.365 	 0.0254 	Zips

Pecota does a good job in minimizing the error, but overall, all the systems hold their own well.

TEST #4c - Marcel forecast of .320 - .350 (average hitters)

wOBA	Error	System
 0.335 	0.0000	Actual
 0.335 	 0.0262 	Chone
 0.334 	 0.0272 	Marcel
 0.334 	 0.0264 	Oliver
 0.335 	 0.0279 	Pecota
 0.335 	 0.0266 	Zips

Now the tables are turned a bit a Pecota brings up the rear, but generally speaking, no standouts here.


TEST #4d - Marcel forecast of .290 - .320 (fair hitters)

wOBA	Error	System
 0.310 	0.0000	Actual
 0.312 	 0.0268 	Chone
 0.308 	 0.0278 	Marcel
 0.308 	 0.0268 	Oliver
 0.310 	 0.0278 	Pecota
 0.309 	 0.0279 	Zips

Chone continues to add to its tiny advantage, but again, not much to learn here.

TEST #5d - Marcel forecast of .under .290 (poor hitters)

wOBA	Error	System
 0.288 	0.0000	Actual
 0.289 	 0.0287 	Chone
 0.283 	 0.0288 	Marcel
 0.284 	 0.0265 	Oliver
 0.286 	 0.0281 	Pecota
 0.282 	 0.0284 	Zips

Chone gets the overall mean the best, while the others seem to be too pessimistic.  This result is somewhat consistent with the “best hitters” forecast, and that is that the other systems generally don’t regress enough.  the average error hhowever favors those systems anyway!  This may be due to the selection bias issues noted earlier.

TEST #6 - Full Breakdown by Reliability and Quality

I offer this breakdown with no commentary, and somewhat poor formatting.  Copy / paste into your spreadsheet program.

Reliability PureRookie Quality Actual Chone Marcel Oliver Pecota Zips Chone Marcel Oliver Pecota Zips n PA
1_Low 1 3 0.319 0.306 0.335 0.320 0.313 0.307 0.044 0.042 0.041 0.043 0.044 397 41911
1_Low 0 1 0.289 0.296 0.287 0.296 0.289 0.288 0.024 0.022 0.019 0.021 0.023 19 1939
1_Low 0 2 0.300 0.306 0.310 0.307 0.300 0.303 0.036 0.037 0.035 0.038 0.037 223 25947
1_Low 0 3 0.318 0.317 0.332 0.321 0.316 0.314 0.034 0.036 0.035 0.036 0.033 298 45582
1_Low 0 4 0.330 0.329 0.357 0.339 0.337 0.333 0.031 0.046 0.033 0.030 0.031 22 7598
2_Med 0 1 0.284 0.286 0.282 0.282 0.284 0.282 0.032 0.035 0.029 0.032 0.031 41 6480
2_Med 0 2 0.314 0.310 0.307 0.309 0.308 0.306 0.034 0.036 0.034 0.035 0.037 155 28339
2_Med 0 3 0.333 0.331 0.332 0.333 0.334 0.330 0.028 0.027 0.028 0.028 0.028 133 40529
2_Med 0 4 0.348 0.348 0.361 0.354 0.350 0.354 0.022 0.020 0.026 0.023 0.024 26 13474
2_Med 0 5 0.377 0.378 0.394 0.393 0.397 0.378 0.001 0.018 0.016 0.020 0.001 1 653
3_High 0 1 0.289 0.290 0.283 0.282 0.286 0.282 0.028 0.027 0.026 0.027 0.028 56 13260
3_High 0 2 0.311 0.314 0.308 0.308 0.312 0.311 0.023 0.023 0.023 0.023 0.023 325 108057
3_High 0 3 0.339 0.339 0.334 0.336 0.339 0.340 0.024 0.025 0.024 0.026 0.025 517 226849
3_High 0 4 0.364 0.366 0.362 0.366 0.366 0.369 0.025 0.025 0.025 0.024 0.025 198 105783
3_High 0 5 0.398 0.401 0.394 0.406 0.396 0.403 0.025 0.025 0.027 0.025 0.027 69 41293

TEST #7 - Extremes

I counted how often a forecast came within 10 points of the actual wOBA.  I called that a “great forecast”.  I also counted how often a forecast missed by at least 40 points of the actual wOBA. I called that a “useless forecast”.

Great	Useless	System
 0.253 	 0.218 	Chone
 0.260 	 0.235 	Marcel
 0.249 	 0.228 	Oliver
 0.236 	 0.235 	Pecota
 0.249 	 0.232 	Zips

As we can see, Marcel got the most great forecasts.  In terms of the Great/Useless ratio, Chone performed the best, with Marcel coming in second.  Pecota had as many great forecasts as it did useless ones. 

TEST #8 - Head-to-Head

Seeing that Chone was performing the best, I decided to do a head-to-head of Chone to Marcel for the 2480 individal forecasts.

I counted as a win for a system if it had a forecasted error that was at least 10 points better than its opponent, and that its forecasted error x 1.5 was less than its opponent.  As an example, if Chone’s forecasted error for Evan Longoria was 25 points and Marcel’s was 36 points, I called that a tie.  If it was 15 and 26, I called that a win. 

I weighted the “games” by the number of actual PA. 

They tied in 74% of the matchups.  That is, in 74% of the forecasts, the forecasts were too close to call.  Counting ties as half a win, the overall win% of Marcel was .501 against Chone.  That is, pretty much, the two systems were equal by this measure.


CONCLUSION

To the extent that we want to declare a winner, Chone seems to have done the best, overall, and in some various breakdowns.

But, all the four systems performed very well, both against each other and against Marcel.  The strong showing of Marcel serves as a cautionary note that it would be very difficult for a forecasting system to stand alone as the best.  They are all very close, and there’s little to distinguish between them.

 

#1    J. Cross      (see all posts) 2011/02/16 (Wed) @ 02:11

Good stuff.

These really are close.  Tango, do you have the data set (better yet, do you have the dataset including individual stats like BB, H, HR…)?


#2    Xeifrank      (see all posts) 2011/02/16 (Wed) @ 02:41

Very interesting and kudos to Brian for getting the data and Tango for running the tests.  A couple of questions/comments.

1. What kind of margin of errors are we talking about on these tests?

2. As a science project it would be interesting to see how each system does on “in season” projections.  It would give you more data points for each system.

3. Using the 2010 engine on the previous seasons data is a large advantage, as pointed out.

4. Those tiny fonts in tables are hard to read.

5. Any thoughts on changing some of the Marcel constants for players with none or very little playing time or do you prefer no changes ever?

6. Pitching up next?  smile


#3    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 02:47

The file I used after all the massaging is the first link here:

http://www.tangotiger.net/files/

Brian’s source file is his property.  I’ll let him decide if/how he distributes.  If he wants, he can use my site for that too.


#4    MGL      (see all posts) 2011/02/16 (Wed) @ 02:48

To show you how sensitive these tests are to the pool of players chosen and to the method used to compute wOBA (see my thread on testing the systems), here is what Tango got and here is what I got using the same projection systems!

2010   Tango   Me

Chone 0.0283   .0265
Marcel 0.0281   .0265
Oliver 0.0279   .0292
Pecota 0.0283   X
Zips   0.0285   .0267

Cairo         .0270
Steamer       .0283
MGL         .0267
BBGuru       .0274


#5    Matt Klaassen      (see all posts) 2011/02/16 (Wed) @ 02:50

Thanks, Tango, for all your work on this, and (perhaps even moreso) to Brian for completing the arduous task of matching everything up and submitting it for neutral examination.

I rarely bookmark blog posts, but this one made it.

Ah, Rally, I’d say we didn’t know what we had until it’s gone, but isn’t this how many of previous similar tests ended up, too (i.e., if you have to pick a winner, it’s CHONE)?


#6    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 02:52

Xei/2:

1. I don’t know.

2. Maybe, but sample size will not be kind to you.

3. It could be.  But then again, Marcel has not changed at all ever, so, I think all it gives Oliver is that it goes from being an also-ran to be in the thick of things, since it’s the newest of the systems.

4. Reads fine on my Firefox.  Click CTRL+ and see what happens.

5. I prefer no changes ever.

6. If Brian has that data…


#7    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 03:06

By the way, the file in post #3 means that anyone can reproduce what I just did.  You just “group by” the necessary columns, and do a weighted average, and bam, it’s done.

So, I look forward to others also slicing/dicing the data.

For example, what if you take the average of the other 4 and compare to Marcel.  Will they beat him handily?  What if the spread among the other 4 is fairly wide: does this mean there’s alot of uncertainty, and therefore, may not do as well as the forecasts where the 4 agree strongly?

Plenty of questions to answer.  The data is there for all to use.


#8    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 04:28

Man, I miss CHONE.

CHONE was ...

[1] Free
[2] Available
[3] Accurate
[4] Projected for ~3000 players
[5] Released all at the same time.

Zips and Cairo are still available for free, as are Marcel and Guru. ZiPs and CAIRO project for about 65 players per team, Marcels and Guru 55 per team.

I do find Oliver projections ot be higher for young players than the other systems, and I find that CAIRO combined with Oliver makes for a good group, and ZiPS is somewhat “in between” those two.

I think guru uses a lot of Tom’s/Marcels stuff, so the projections are naturally “alike”, although I am not too certain how unique guru projections are as compared to Marcel.

I’ve never purchased PECOTA projections.


#9    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 05:01

I took this data and grouped the players into deciles for each season based on their actual wOBA as well as their various projected wOBAs to see how well the systems got the rank order right. The metric is the absolute difference between projected and actual wOBA decile,  weighted by PA.

Overall:

Chone 1.78
Oliver 1.79
Zips 1.79
Pecota 1.83
Marcel 2.16

The results can vary significantly be decile though:

For the Best Decile (top 10% in wOBA):
Zips: 0.89
Chone 0.90
Pecota 0.93
Oliver 0.96
Marcel 1.23

For the Worst Decile:
Chone 2.61
Pecota 2.77
Oliver 2.79
Zips 2.80
Marcel 3.25


#10    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 05:04

Great work, and a great write-up as well. This was extremely informative.  I loved so much of this-having a Great/Useless projection ratio was a good addition at the end.

Thanks for such an excellent explanation of selection bias.  Since I don’t work with projection systems, it doesn’t occur to me how much that plays a factor, and I really did learn something. A system can be really good at predicting the talent level of minor leaguers, but the few that are off redistribute the weight enough to drive the error way up.

I’d be interested in seeing PAs neutralized for players with no major league history-just a raw cut of the average wOBA’s for each player against his projection-that would eliminate the bias, would remove some of the functionality in projections.  In fact, it strikes me that there’s something to be learned by looking solely at the rate stats for players over a minimum threshold just to exclude those who had one or two PAs and posted perfect batting averages (something like 20 PAs).


#11    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 05:27

You can try to look at a minimum PA threshold, which is how most forecasting comparisons are done.  But that too gives you a selection bias.  After all: who gets at least 150 PA?  If you set the threshold lower, say like 50 PA, then you’re going to get wild actual results that really have no bearing on anything.  Again, after all, anything can happen in 50 PA.  And you would propose equally weighting that to the guy with 600 PA.

Alternatively, you can calculate z-scores for each player, to find out how much the actual was above the forecasted, by taking into account the number of PA.  I’ve done that in the past.

Really, there’s tons of ways of doing this, and in every single way, you’re going to get some bias of some sort.

It’s a matter of figuring out which bias you can live with.

***

tb: he’s proposing comparing the ordinal rankings of each system.  I don’t see how we’re better off.  There’s 2480 players.  Clearly the ordinal ranking between player #1 and #11 will have a much wider wOBA gap than players ranked between #1401 and #1411.

So, to continue on his path, you would want to change each system’s wOBA score into a z-score (but relative to the standard deviation of the forecast).  If the forecasted SD is .030 and a player has a forecast of .395 wOBA, then he’s +.060 wOBA above themean, or +2SD above the mean.  Rather than using ordinal ranks, you use this.


#12    James Piette      (see all posts) 2011/02/16 (Wed) @ 06:19

Great post!  It is definitely one of those things that pops into your head every once in awhile, especially during fashion week- er, I mean, projections week.

Two things.

(1) Not sure that you’ve ever heard of his work/system, I would suggest you include the forecasts done by Brad Null.  He wrote a paper using a nested Dirichlet model to do player projections (published in JQAS).  It fared very well, at least according to his write-up; it might be interesting to incorporate his model (though, I doubt any significantly different conclusions will come from it).

(2) How did you come up with your win/tie thresholds for head-to-head?  To me, that is actually the most compelling test to run, as talk of squared error with little base of comparison makes things difficult to interpret.  In fact, you could do the head-to-head, comparing all three models.  Any thoughts on pursuing that?

Again, thanks for the great post!


#13    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 06:58

James, thanks.

You are obviously new around here so welcome.

Yes, Brad is a commenter here and is part of the annual Forecasters Challenge that I run.

The dataset I used was provided by Brian, and I had NO desire to link ID to any other system.  If Brian wants to do it, then I’ll just update my program to run it with any new system.  Personally, I’d like to see MGL and Bill James data in there.

***

AS for the head-to-head, yes, I had intended to do the head-to-head between all 5 systems, and report an overall win%.  Seeing how well Marcel did against Chone however made me decide against it.  I think it proved the point. I can run it against all of them though.


#14    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 07:40

By the way, one thing I won’t accept in terms of criticism is if someone says “yeah, but…”.  Why?  Because the data is there.  It’s published.  If you have an idea of how else to show results, then run it yourself and report the results!


#15    Brian Cartwright      (see all posts) 2011/02/16 (Wed) @ 07:41

Tango, feel free to post the data file I sent.

I will get a pitching file together in the next week or so.

As Tom says, please, please put id’s, preferably mlbam, into any files offered to the public. Pecota had previously listed HoweID, Oliver and Pecota for 2011 have mlbam, retro and others listed.

Oliver batting projections were first produced privately and announced in Aug 2008 at StatSpeak. The 2009 batting preseason projections were hosted at FanGraphs. 2010 batting, pitching and fielding became The Hardball Times Forecasts.

I saw mgl’s test thread forthe first time last night, and I was disappointed with how Oliver 2010 peformed, but I see in Tango’s test that the 2011 version, in retrospect, has shown great improvement.

The first thing I take away from reading these results is that Oliver needs a little more regression for the better players (who mostly also have the most playing time). I ran empirical tests, finding the amounts of regression that gave the lowest rmse, but those were for all players, I did not break them down into groups by quality as Tango did. And, I was looking more to reign in minor league outliers when examining the results.

Many thanks to Tango for his willingness to volunteer for this project.


#16    J. Cross      (see all posts) 2011/02/16 (Wed) @ 08:18

Many thanks, Brian!


#17    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 08:45

Ok, so I calculated Z Scores for each season for each system as well as they actual stats. I then calculated the absolute difference in Z Score between real and projection and weighted by PA.

Overall,
Chone .647
Pecota .648
Oliver .654
Zips .661
Marcel .746

I also grouped the players in wOBA decile by season.

For the best decile(top 10% of players each year in wOBA)
Pecota .845
Zips .915
Oliver .941
Chone .979
Marcel 1.087

For the worst decile:
Chone 1.011
Pecota 1.020
Oliver 1.036
Zips 1.053
Marcel 1.169


#18    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 09:13

It seems like there’s another problem for predicting young players…

Suppose you have A and B and both have a .330 true talent wOBA.  Player A arrives and hits .330 for his first 150 PA, while player B hits just .280 over his first 150. The result will be that player A sticks around while player B gets shipped back to AAA.

This is going to make the systems look too pessimistic. The result will be 300 PA from player A who has a .330 wOBA while they have 150 PA from player B who has a .280 wOBA. That’s an average result of .313.


#19    philosofool      (see all posts) 2011/02/16 (Wed) @ 09:20

At myself, #18…

I’ve now confused myself and think that the result of this sort of selection should work in favor optimism on the part of the systems. I feel like the sort of selection I’m thinking of should work in favor of pessism rather than optimism.

Let me try this again. Imagine a curve of normally distributed results centered on a mean of .330. The guys who randomly end up on the far left side, i.e., actual performance under .330, will tend to get sent down. They guys who play at near .330 and higher will stay. But the normal distribution is what you would expect if you were looking at a group of .330 talents. However, your observations would be biased because you would end up seeing more of the guys who managed to be near or above .330.

Sorry the math in my last example was not a good illustration of this point.


#20    tangotiger      (see all posts) 2011/02/16 (Wed) @ 09:22

tb: I’m pleased that you’ve found some way to show that Marcel is inferior.  It seems however that the gap is way too large.  I’m going to replicate this tomorrow and presenting my results.

As for grouping by OBSERVED (actual) wOBA: that’s not a good idea.  By definition, alot of the high end players will have had a high wOBA because they had more good luck than bad luck.  If you report the mean of the 10% for each, you’re going to show the mean actual was like .410, while the mean forecast for those players in each system is going to be .370 to .380 or something.  You can’t group on the dependent variable.


#21    tangotiger      (see all posts) 2011/02/16 (Wed) @ 09:27

Agreed, that selection bias abounds.  This is another variant of what I put in bold, that the playing time is not independent of the talent level.


#22    tangotiger      (see all posts) 2011/02/16 (Wed) @ 09:29

tb: are you using wOBA3?


#23    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 09:39

yes, wOBA3

2010: avg wOBA
actual .308
Chone .322
Marcel .329
Oliver .324
Pecota .320
Zips .322

2010: sd wOBA
actual .075
Chone .029
Marcel .022
Oliver .031
Pecota .031
Zips .029

Should I weight the AVG and SD by PA ? I weighted the error, but not the “inputs” to the Z Score.


#24    J. Cross      (see all posts) 2011/02/16 (Wed) @ 10:01

Huh.

I replicated what tbwhite did b/c it seemed really odd to me that Marcel would fall so far back using this test and got (virtually) the same result:

chone   0.6462
marcel   0.7431
oliver   0.6505
pecota   0.6450
zips   0.6592

I wondered if Marcel could be getting punished for all of it’s null (identical) values:

system     Null
chone   125
marcel   397
oliver   2
pecota   157
zips   118

Standardizing both wOBA and all of the projections based on the 1960 non-null players:

system     non-Null
chone     0.604947378
marcel     0.687800737
oliver     0.627915605
pecota     0.605826941
zips     0.618470837


#25    MGL      (see all posts) 2011/02/16 (Wed) @ 10:12

Here is one of the big problems associated with weighting by PA, similar to what philosofool was trying to get at:

Say you have two players who are true .330 and their projections were perfect (.330 for each).

One player hits .280 in his first 150 PA and the other hits .380 (philosofool, when you give these examples, make sure that the weighted mean of all the observations is equal to true talent and make sure that all players hit their true talent in random, out of sample time periods!).

Player A gets benched or sent down and player B has 450 more PA, at .330 of course (his true talent), for a total of .3425 in 600 PA.

So, your projection evaluation would look like this:

.330-.280 weighted by 150 PA
.3425-.330 weighted by 600 PA

(.050 * 150 +  .0125 * 600)/750, or

.020 (the average absolute error for these 2 forecasts).

But what if the forecaster deliberately inflated his forecasts to .335?

We now have:

(.045 * 150 +  .0075 * 600)/750, or

.015 (the average absolute error for these 2 forecasts)!

And if he inflates to .3425?

.0125 is the average error!

So, if any forecaster either accidentally (and incorrectly) is too optimistic, or he deliberately inflates his forecasts, he is going to do better in any evaluation that weights by PA.

Scaling everyone to their own mean helps with this issue, but I’m not sure it resolves it completely.

There definitely is some merit to not weighting by PA.  However, as Tango says, if you don’t, you are going to have a lot of fluctuation in the evaluation results caused by players who have anomalous (compared to their true talent) performance in a few PA, but at least there is no bias.

And if you try and get around that by having a min PA (or IP or TBF for pitchers), then you have a similar selective sampling issue. In fact, now it is even worse, and a forecaster definitely would fare better if he inflates his projections since all your players with a min number of PA will be part of a “lucky” group…


#26    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 10:16

Very interesting.

tb: yes, you should weight by PA.  You will get wOBA of .335000 for all the systems and actual.  Otherwise, you are going to weight someone with 1 PA identically to someone with 700 PA.


#27    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 10:20

Rather than decile by actual wOBA, I deciled each system on it’s own projected wOBA. So, how each system did on the players it said would be in the beat and worst deciles.

Best decile:
Pecota 1.122
Zips 1.189
Oliver 1.205
Chone 1.211
Marcel 1.257

Worst Decile:
Oliver 1.267
Chone 1.372
Zips 1.382
Pecota 1.482
Marcel 1.539


#28    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 10:21

MGL: right, I think you’ve hit upon why Marcel does so well with the pure rookies.

If you just look at the 397 Pure Rookies in my file, where Marcel gives everyone a wOBA of .335, see if you can give some insight into the results.

Look at wOBA3 which is the “Final” wOBA I use for everyone.


#29    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 10:22

Weight by actual PA, or weight each system by the projected PA by the system ?


#30    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 10:34

Actual PA.  3 reasons:

1. Everyone gets the same weight
2. None of the systems stand much behind the PA forecast
3. We only care about testing rate stats, not playing time stats


#31    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 10:42

I sort of figured that, so I was already running it that way already when you replied.

Overall:
Oliver .0406
Chone .0409
Zips .0411
Pecota.0419
Marcel .0431

Best Decile(Forecasted)
Pecota .0439
Zips .0439
Chone .0444
Oliver .0460
Marcel .0471

Worst Decile(Forecasted)
Oliver .0586
Marcel .0608
Zips .0684
Chone .0734
Pecota .0786


#32    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 10:45

Here’s the SD of the forecasts:

0.030   sdChone
0.025   sdMarcel
0.032   sdOliver
0.031   sdPecota
0.032   sdZips

We see that Marcel simply has a tighter spread.  If I exclude the 397 Pure Rookies:

0.029   sdChone
0.026   sdMarcel
0.031   sdOliver
0.030   sdPecota
0.032   sdZips

Marcel simply keeps everyone more tightly bunched.

Now, if we divide each forecast by these SD, the z-score for Marcel will end up being higher for the great hitters than it would for the other systems.  Take for example Albert Pujols:

Chone has him forecasted at .438 over the 4 years and MArcel has him at .439.  But, if I turn that into a z-score, I’m going to do for MArcel:
(.439-.335)/.026

And for Chone:
(.438-.335)/.029

Now all of a sudden, MArcel’s z-score for Pujols is 3.9 while it’s only 3.5 for Chone.

Furthermore, Pujols’ observed wOBA was .449.  But the SD for the league actuals is a very high .044.  That makes Pujols’ observed wOBA’s z-score as:
(.449-.335)/.044 = 2.6

So now, where Marcel and Chone were very close (and under-estimated) Pujols’ wOBA, now they are showing as over-estimating it.  And Marcel is severely overestimating.

This I think is a process that I like to call “mathematical gyrations”.  We are simply subtracting, squaring, dividing numbers to the point where we lost meaning as to what it is we are doing here.


#33    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 10:49

Just to continue from above, all wOBA were .335, SD’s are listed below:


System (2007,2008,2009,2010)
Actuals .786,.744,.745,.752
Chone   .493,.512,.504,.487
Marcel   .439,.428,.439,.416
Oliver   .536,.542,.539,.516
Pecota   .521,.503,.569,.482
Zips   .557,.551,.560,.512


#34    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 10:57

tb/33: not sure what those numbers mean.  Do you have the decimal in the right spot?

***

Here are the RMSE:

0.0363   rmseChone
0.0375   rmseMarcel
0.0364   rmseOliver
0.0374   rmsePecota
0.0369   rmseZips


#35    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:09

Here’s something cool I did.  I took the 397 Pure Rookies, and broken them up by ACTUAL wOBA.  Now, I don’t normally suggest that.  At the same time, those guys who have a high actual wOBA should have been forecasted for at least for a higher wOBA than the guys with a low actual wOBA.  We won’t get the degrees right, but we should at least get the direction.

For example there were 51 Pure Rookies with an observed wOBA of at least .380 (on 3758 PA).  There were 196 Pure Rookies with a wOBA of at most .290 (on 9469 PA).

The average actual wOBA of the two groups is .403 and .241.  (Do I need to say respectively?)  Now, we do not and cannot expect the forecasters to have forecasted this for the group.  By definition, the guys who aveaged .403 got way more lucky than the .241 guys.  But, we should AT LEAST expect the forecasters to be a bit stronger in their forecasts for the guys who we observed to hit .403 than .241, correct?

MArcel we know forecasted .335 for each.

Chone? .315 and .295.  Is that a good range?
Oliver: .333 and .305.  That’s a bit bigger range.
Pecota: .331 and .303 (like Oliver)
Zips: .326 and .297 (like the other two)

So, it seems to me that Chone might be a little conservative on the Pure Rookies.  Maybe.  In fact, I really don’t know what the range SHOULD be.  Maybe the range should have been 20 points.  We don’t know.

This is the data:

n    PA2…...    actual    cWOBA    mWOBA    oWOBA    pWOBA    zWOBA
196    9469     0.241      0.295      0.335      0.305      0.303      0.297 
64    9723     0.308      0.307      0.335      0.317      0.308      0.305 
46    9855     0.334      0.312      0.335      0.331      0.323      0.310 
40    9106     0.364      0.305      0.335      0.324      0.312      0.309 
51    3758     0.403      0.315      0.335      0.333      0.331      0.326

#36    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:11

Actually, we might know: look at how those Pure Rookies that I broke up into the 5 classes did in the NEXT year (sophomore).  That will give you an unbiased estimate of their talent level.


#37    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 11:19

re33: That’s the SD as calculated by SAS, weighted by the variable PA2.


#38    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:26

tb/37: Well, that’s not possible!  If the mean is .335, and every player is pretty much between .250 and .450 (i.e, +/- .100), how can you have an SD of .500?

Indeed, if it was bounded from .000 to 1.000, you STILL wouldn’t have an SD = .500.  Even a uniform distribution between 0 and 1 would give you an SD of only .29.


#39    J. Cross      (see all posts) 2011/02/16 (Wed) @ 11:29

I’m not sure whether or not it’s fair to call this mathematical gyrations.  Comparing z-scores is basically the same as taking a correlation and we see basically this same thing if we do that:

chone     0.335
marcel     0.276
oliver     0.332
pecota     0.309
zips     0.337

Why does it make more sense to use RMSE than correlation? 

Basically, this is standardizing the slope (setting the standard deviation in wOBA to be the same for each system) in addition to standardizing the intercept (setting the mean wOBA to be the same), right?


#40    J. Cross      (see all posts) 2011/02/16 (Wed) @ 11:30

I think if you take those SD’s that tbwhite has and divide them by the sqrt(average PA) you’ll get something that looks like a regular SD.


#41    J. Cross      (see all posts) 2011/02/16 (Wed) @ 11:34

Note: those correlations are *not* weighted by PA.  I need to use a different program to get correlations that are weighted by PA.


#42    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:36

Remember, I said this:

Chone? .315 and .295.  Is that a good range?
Oliver: .333 and .305.  That’s a bit bigger range.
Pecota: .331 and .303 (like Oliver)
Zips: .326 and .297 (like the other two)

So, I looked to see how the Pure Rookies did in year 2.  Of the players who had a woba of at least .380 as a rookie, they had a woba in year 2 of .325.  Of the players with at most .290 wOBA as a rookie, they had a year 2 woba of .307.

It seems to me that the range therefore that we should presume as a true talent for these two groups is about 18 points.

Chone has 20 points.  Chone has the right range, but the overall mean estimate is too low.  It should forecast the Pure Rookies at 10 points higher.  Indeed, in my main report I showed this:

wOBA   Error   System
0.319   0.0000   Actual
0.306   0.0436   Chone

By this measure, Chone was underforecasting by 13 points (but, as we noted, there may be some selection bias issue).  Indeed, even in the way I’m doing it by looking at year 2, there’s not only selection bias issues, but upward sloping aging as well.

Basically, I think Chone’s got a really good handle on it.  It might be underforecasting by up to 10 points maybe.


#43    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 11:41

#40 - Yes, I suspect the wOBA gets weighted first, then the SD is calculated off the weighted numbers. I’d have to dig thru some probably arcane SAS documentation to confirm and I don’t know if I can stomach that tonight.


#44    J. Cross      (see all posts) 2011/02/16 (Wed) @ 11:42

Correlations weighed by PA2:

with wOBA3:

chone: 0.589
marcel: 0.547
oliver: 0.593
pecota: 0.565
zips: 0.584

with marcel:

chone: 0.847
oliver: 0.877
pecota 0.823
zips: 0.855

chone, oliver and zips are winning here so it might not be surprsing that…

chone/olver: 0.916
chone/zips: 0.921
oliver/zips: 0.914


#45    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:44

“this is standardizing the slope”

Right.  But WHY would we standardize the slope?  There’s no reason to.  I already showed the example of what happens if you standardize Marcel’s slope.  That’s why I’m against correlations, because it forces the slope.

Marcel has decided that the top-end wOBA is Pujols at .439.  By standardizing the slope (changing the SD from .026 to .030 let’s say), you are giving MArcel a defacto forecast for Pujols of .030/.026*(.439-.335)+.335 = .455.

This is why I call it mathematical gyrations.  By standardizing, you are hiding what you are in reality doing.

What you are really doing is giving each of the forecasts the same SD for its forecasts.  But hiding it by making the SD = 1.  But, make the SD = .030 for everyone, and then see what you get.  It’s obvious that Marcel gets killed here.

But MArcel CHOSE to have a more shallow slope.  You can’t then tell it to give it a steeper slope just to fit it to what you want!

Furthermore, you are also standardizing the OBSERVED!  So, Pujols’ actual .449 will get standardized down to (.449-.335)*.030/.044+.335=.413.

So, now you have Marcel’s standardized wOBA as .455 compared to his standardized actual of .413.  Even though in reality Marcel said .439 compared to Pujols .448.

Sorry.  This is mathematical gyrations, standardizing because that’s what is normally taught to do in stats class.  You can’t do it here.


#46    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 11:51

This is the reality you have to deal with:

1. Marcel and the others forecasted the wOBA, and they did not forecast a z-score.  The slope is the slope and you can’t change it.

2. We are testing against the observed, which means true + random luck.  We are not presuming that luck is linked to true, which is what standardizing the observed implies to do.


#47    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 11:51

re 32: This seems more like a feature than a bug to me.

How a player relates to the overall population seems like an important part of an evaluation to me. In a draft situation if you’re thinking about positional scarcity, it matters how much you think the drop off is at a position if you pass on the best current player available at that position.

I realize the necessity of regressing to a mean, but it seems like this is a natural trade off or cost of regressing too much. Variance gets extremely understated compared to reality, which would cause the over-valuation of the best players, because they are perceived to be further from the mean than they actually are.


#48    J. Cross      (see all posts) 2011/02/16 (Wed) @ 11:52

Finally, using the format I used last year and calculating z-scores for the hypothesis that each of these systems has the same true correlation with actual wOBA as marcel:

Chone 3.74
Oliver 4.44
Pecota 1.54
ZiPS 3.37

expressing those as equivalent to W-L records (against the null hypothesis that each team is a true .500 ball-club):

Chone 105-57
Oliver 109-53
Pecota 91-71
ZiPS 102-60


#49    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 12:34

re 45: The Z Score was your suggestion.

My point is simply that any measurement of the accuracy of projections should be based on how they are used. I would wager that most people use systems like Pecota, Oliver, etc. for the purposes of drafting players for a Roto League or a Strat League or something like that. In that context, whether a system was off on average by only .01 wOBA or .012 wOBA really doesn’t matter. What does matter is how well it identifies or separates good players from bad players. Nothing I have seen in discussions about testing forecasting systems has addressed this disconnect between how these systems are tested and presumably calibrated and how they are used. To be clear, I’m not suggesting that there would be wholesale differences in the order that the current systems “rank” players, but how significant is .002 wOBA to the average consumer of these projections anyway ?

Please correct me if I am wrong, but Marcel probably has the shallowest slope because it employs the simplest(or heaviest) regression to the mean(it uses the MLB mean, while other systems sound like they use some sort of peer mean). If that is correct, and if you had a set of systems with roughly equivalent RMSE, could you use the slope as a proxy for the quality of the projections ? The steeper the slope the more powerful the projections must be if all of the systems being measured end up with negligibly different RMSE. You need both components though because obviously you could game the system either way, a really steep slope at some point will increase your RMSE dranatically, and if the slope is too shallow you can hardly even call it a projection(I’m not suggesting Marcel is at that point).


#50    J. Cross      (see all posts) 2011/02/16 (Wed) @ 12:37

I’m sure you’re right about this, Tango, but I’m still trying to wrap my head around it.  I’m frankly surprised that we see such a difference using RMSE v. R as our metric.  If you create a new Chone, for instance, using the same standardized league mean and the old-chone z-scores but marcel’s standard deviation it would have the same correlation as the old chone and the same standard deviation as marcel but it actually has a higher RMSE than the old chone.  That’s not making sense to me right now (although I tried this and it’s all true) but maybe I need to get to sleep.

Here’s are the slopes of wOBA v. pred-wOBA:

slopes  
chone   1.03
marcel   1.11
oliver   0.97
pecota   0.89
zips   0.96


#51    OB      (see all posts) 2011/02/16 (Wed) @ 12:44

On TEST #6:  Apropos of nothing, I find this to be aesthetically pleasing.  Kudos.


#52    J. Cross      (see all posts) 2011/02/16 (Wed) @ 13:09

Further sign that I’m getting old:

We’ve had this conversation before.  Tango, I do see that you’ve been consistent in rejecting correlation as a means to evaluate forecasts and a little reading reveals that average error and root mean square error appear to be the agreed upon ways to go.  Point taken.


#53    MGL      (see all posts) 2011/02/16 (Wed) @ 13:13

What is wOBA3?


#54    J. Cross      (see all posts) 2011/02/16 (Wed) @ 13:27

Re: wOBA3.  It’s from the spreadsheet.

I think it’s just adjusted so that every system has a weighted average of .335.


#55    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 13:38

I’m not convinced that it’s appropriate to reset every forecast to the same league mean.  If a forecaster did a poor job at predicting (directly or indirectly) the league mean, then that seems to me to be a mark against their system.  I recognize that weather, the ball, etc. are not interesting things for a forecaster to account for.  However, things like that do not represent *all* of the differences in baseline levels.  For instance, a forecaster could handle park effects better for young ballparks, so he would have a better handle on the league mean.

So let’s go back to the example where Chone forecasted a .340 wOBA in a league of .330.  Chone still got that player right even though it got the league mean wrong.  Effectively Chone did a poor job (in this hypothetical) of forecasting the hundreds of other players, so it missed on the league mean.  When we evaluate the forecast in its entirety, Chone still gets penalized for missing the league mean on those other players but gets credit for nailing the .340 guy.

I’m not sure this would materially affect the results of the comparison, but it bothers me philosophically.


#56    James Piette      (see all posts) 2011/02/16 (Wed) @ 13:58

Touche, Tango.  I apologize if I seemed rude earlier.  I was merely trying to understand where you were coming from.

Anyway, I took you up on your offer and did the following (I hope that I’m not reposting something someone has already done in the comments; I don’t believe so, but apologies if I do):

(1) Ran the head-to-head for all projection systems with no ties.
(2) Ran the head-to-head for all projection systems with 19 thresholds, going from 1/(19*20) to 0.05.
(3) Plotted them in a (shitty) bargraph.

Here are the links to my code and graph.  If I get more time tomorrow (and people think it helpful), I’ll make a better chart.

http://stat.wharton.upenn.edu/~jpiette/win-bargraph.pdf
http://stat.wharton.upenn.edu/~jpiette/test.forecast.R

My quick thoughts: PECOTA looks much better using this method of comparison.  Then again, it says nothing about how much significantly better…


#57    tangotiger      (see all posts) 2011/02/16 (Wed) @ 20:20

wOBA3: right, it’s just the field you should reference on the file I posted, which handles the “null” values by setting .335 for Marcel and .315 for everyone else.

***

tb: I’m not saying you are wrong.  I’m saying that you have to define the problem and methodology much clearer so that a solution can present itself.  I just don’t see how the solutions as you’ve proposed it (or my suggestion to help out, which was wrong as it turns out) helps here.  I can explain everything I’ve been doing, justify why I’m doing it, and explain the meaning of the results in real terms.  I can “sell it”.  So far, you haven’t proposed a solution that is sellable.  At least, not one that I’m buying anyway, FWIW.

***

James, I did not take your post in any negative light.  I’m surprised you read my response to give you that indication.  Don’t worry, that’s just the way I yap.


#58    Tangotiger      (see all posts) 2011/02/16 (Wed) @ 22:16

tb: how about this as a way to test your methodology.

You care about ordinal rankings, correct?  And you want to make sure that whoever Marcel is better is in fact going to produce more than the next ranked player correct?

Why not then think of it in terms of “draft order”?  What if I make a team of 124 players where team 1 has the #1, #21, #41 ... #2461 ranked player.  Then I make a second team that has the #2, #22, #42… #2462 ranked player…. And so on… And I make a 20th team that has the #20, #40… #2480 ranked player.

The “best” forecasting system is one that has the largest slope.

And I can do a “perfect” forecasting system that simply uses the actuals in the same way.

Then we simply compare the slopes of the 20 teams for each of the forecasting systems relative to the actual slope.

Will this address your need?

I think it’s a great idea, and I’ll run this in a few minutes.


#59    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/16 (Wed) @ 22:43

I think that sounds promising, I like the idea that their is a “right” or “perfect” answer that we can compare to, not just across systems, but also across years. But, how does playing time get handled ? It seems like it definitely needs to be weighted by playing time.


#60    Rally      (see all posts) 2011/02/16 (Wed) @ 22:50

“I’m not convinced that it’s appropriate to reset every forecast to the same league mean.  If a forecaster did a poor job at predicting (directly or indirectly) the league mean, then that seems to me to be a mark against their system.”

Mickeyg13,

Depends what you want.  If you want to see who can hit the actual stats best, then by all means pay attention to who best forecasts the league average. 

But most projection users, whether real teams or fantasy teams, are using these in a competitive situation.  What they care most about is whether you correctly identify which players are better than others.

Let’s say MLB invents an army of Strasbots to replace pitchers.  They can pitch 1458 innings per year and never break down.  Let’s say Pujols, Hanley, Votto, Zimmerman, etc. are still the very best players in the National league, and your forecaster gets their rank order right.

Do you really care if the projection has Pujols at .320/.450/.600 and he actually wins the rate triple crown with .270/.350/.475?


#61    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/17 (Thu) @ 00:13

Rally,

Exactly. A .300 batting average is not an absolute good. It has meaning only because the ML run scoring environment hasn’t changed too dramatically over time, and so a .300 batting average is an effective shortcut to determining if a player is in the top x% of all players. The stability of ML stats over time makes it seem like it is worthwhile to project them, but really it’s fairly irrelevant. If every year, MLB randomly changed the composition of baseballs, for instance sometimes putting superballs in the middle, no one would even try to project the actual stats, it would be impossible. The fact that MLB keeps things fairly constant across time doesn’t make projecting actual stats any more valuable, it just makes it possible.


#62    birdo      (see all posts) 2011/02/17 (Thu) @ 00:23

One clarification question…the actual stats are not park adjusted, correct? 

So if a projection system produces projections that are adjusted for the park where players will play in the coming year (which I believe ZIPS does), this will come out looking better than a system that either
A) just uses past data unadjusted (like Marcel) or B) adjusts all projections to a neutral park (not sure if any do this)?

Does that make sense?  Thanks in advance.


#63    evann      (see all posts) 2011/02/17 (Thu) @ 00:26

I wish this article could be rewritten with every mention of “Marcel” removed.

This just in: Marcel is not interesting.  It performs poorly.  It is not useful for anything.  The only possible use for it would be to provide a baseline for comparison.  However, there are enough projection systems out there that one can use their average performance as the benchmark.  The only instance “Marcel” provided any (seeming) value was when it made no projection at all (for rookies).  But even in that case, it wasn’t that the system performed well, because there was no system employed.  One could just as easily have said in a Marcel-less statement, “Using a generic, static projection for all rookies actually performed better than any specific projection system, suggesting that these systems do not add value for this group of players.”

No need to continue to pat “Marcel” on the back for being kind of close to, or not losing that badly to, other systems.


#64    Tangotiger      (see all posts) 2011/02/17 (Thu) @ 00:50

birdo: correct, the actuals are the actuals (just baselined to a .335 overall average).  A system that knows Adrian Gonzalez moves from Petco to Fenway would seem to have an advantage over Marcel.

***

evann: I can’t stand first-time posters who have to make their first post the way an a$$hole would. “It performs poorly.” is a false statement.  It’s been proven false.  And yet, you come here, in the face of evidence and instead show us your ignorance, and worse try to support your ignorance with bias. 

You are exactly the kind of person that this blog is not for.  If you are just a regular a$$hole, leave and never come back.  If you are a huge a$$hole, you will need to have the last word before leaving.  If you somehow had some sort of first-poster meltdown, and want to retract your post in its entirety (*) and show the world you really are not an a$$hole, then do so.

(*) Other than the part about Marcel being uninteresting which is a matter of opinion, and one I agree with anyway.


#65    mockcarr      (see all posts) 2011/02/17 (Thu) @ 03:59

That’s pretty funny, Tango. Evann’s projection board must be missing a dart.

First time visiting. Thanks for making this stuff available.


#66    Tangotiger      (see all posts) 2011/02/17 (Thu) @ 05:42

Ok, here’s what I did.  I put the top 139 ranked players in group 0, the next top 139 in group 1, and so on.  I ended up with 15 groups.  I used Marcel’s ranking.  I removed the Pure Rookies.  Results below are based on weighting by PA:

rank     Marcel      actual     n
0     0.384      0.385     139
1     0.359      0.360     139
2     0.348      0.344     139
3     0.341      0.341     139
4     0.336      0.337     139
5     0.332      0.330     139
6     0.329      0.327     139
7     0.325      0.326     139
8     0.322      0.334     139
9     0.318      0.322     139
10     0.314      0.312     139
11     0.310      0.316     139
12     0.304      0.303     139
13     0.297      0.299     139
14     0.285      0.291     137

So, if you were drafting a team of 14 hitters based on Marcel, you would get those results.

***

I tried the other way I said it, and it doesn’t work.  That’s because everything is so close, that it doesn’t matter much.  Putting the #1234th player in one group and the #1239th player in another group simply won’t matter.  There’s not enough variation.

***

Anyway, here’s what Chone looks like:

rank     Chone      actual 
0     0.390      0.386 
1     0.362      0.358 
2     0.351      0.352 
3     0.344      0.339 
4     0.337      0.336 
5     0.333      0.333 
6     0.328      0.329 
7     0.324      0.322 
8     0.318      0.318 
9     0.315      0.317 
10     0.310      0.311 
11     0.304      0.304 
12     0.298      0.302 
13     0.290      0.292 
14     0.273      0.280

***

The spread of Marcel’s actual was one SD = .023, while it was .026 for Chone.

So, Chone definitely did a better job of separating the players out.


#67    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/17 (Thu) @ 06:09

Were the group assignments done season by season, or were all of the seasons lumped together and then put into groups ?


#68    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/17 (Thu) @ 06:14

It’s also interesting that the Chone results increase monotonically from worst group to best group, but Marcel had some trouble with the groups in the middle(group 8 was .334, while 7 was .326 and 9 was .322).


#69    Tangotiger      (see all posts) 2011/02/17 (Thu) @ 07:05

Right, those include alot of the Low Reliability, so I expected that.


#70    Tangotiger      (see all posts) 2011/02/17 (Thu) @ 07:30

I also lumped everything together.


#71    J. Cross      (see all posts) 2011/02/17 (Thu) @ 11:17

Tango,

Since Brian gave the okay, will you post the original file?  I want to see how these systems projected individual components: HR%, BB% etc.  It will be much appreciated.



#73    Tangotiger      (see all posts) 2011/02/17 (Thu) @ 22:42

Jared: the FTP software at the office has a 2MB limit.  If you email me, I’ll email it to you.  Otherwise, I’ll ftp tonight.

tom~tangotiger~net <—Wonder if Watson can parse this?


#74    Brian Cartwright      (see all posts) 2011/02/18 (Fri) @ 06:45

Tom, maybe you can post a link here, I would like to get a list of your “pure rookies” - mlbamid and season.


#75    Tangotiger      (see all posts) 2011/02/18 (Fri) @ 07:16

Brian, if you download my file, the “pure rookies” are all those with nullMarcel=1.


#76    tangotiger      (see all posts) 2011/02/18 (Fri) @ 07:29

The file is in post #3.

I call them “pure rookies”, but technically, it’s “anyone who hasn’t played as a nonpitcher in MLB in the last 3 years”.

Which, you would think is about as “pure rookie” as you can get.  Rick Ankiel, by this definition counts as a “pure rookie”.


#77    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/21 (Mon) @ 01:09

I took a look at the AL from 1982-1992 and threw out 1987 due to the high scoring that year, and came up with a way to move the mean based on a player’s PA. This might correct the error Marcel has based on reliability; I’m throwing it out here for someone else’s use.

First, divide team PA by 9, and subtract the player’s PA from that. For players who switched teams, just make the league the team, dividing the league total by the number of teams to make a pretend team. I’m calling this PAgap below.

Now, we multiply the league rate by a number to get the expectation for a good player, who has about a ninth of the team’s plate appearances. Then, you’ll divide the plate appearance gap above by a number, and subtract that number from the rate to get a new rate for the mean.

1B/(AB+SF-HR-SO): league * 21/20, subtract PAgap/9,800 from expectation
2B/(AB+SF-HR-SO): league * 10/9, subtract PAgap/22,000
3B/(AB+SF-HR-SO): league * 6/5, subtract PAgap/159,000
HR/(AB+SF-SO): league * 11/9, subtract PAgap/32,500
(BB-IBB)/(AB+SF+BB+HBP-IBB): league * 17/16, subtract PAgap/28,000
SO/(AB+SF): league * 19/22, add PAgap/4,600
SB/(SB+CS): league * 15/14, subtract PAgap/3,300
SF/(AB+SF-H-SO): league * 9/8, subtract PAgap/93,000
SH/(AB+BB+HBP+SH+SF-IBB): league * 11/20, add PAgap/164,000
IBB/PA: league * 23/16, subtract PAgap/103,000

HBP stay proportionate to PA-IBB-SH
SB+CS stay proportionate 1B+BB+HBP
GDP stay proportionate to AB-H-SO
R-RBI stay proportionate to H-HR+BB+HBP
RBI-HR stay proportionate to TB-HR

I’m kind of inclined to think the complexity of this is too much for Marcel, but I’m throwing it out there. This does say that someone who didn’t bat at all last year hits like a pitcher, which obviously isn’t right, though all players who had 1-8 PA in that period went .158/.216/.229 in 705 PA, which isn’t inconsistent with the other data.


#78    jinaz      (see all posts) 2011/02/21 (Mon) @ 22:23

One thing I found interesting was that the averaged projections didn’t win it all (at least, among those situations where it is reported).  If I remember right, they tended to beat everyone in many prior comparisons (as much as any system can beat another given that they’re all so close).  Here, “ALL” was just middle of the pack, behind CHONE and Oliver. 

On one hand, that makes sense—average the best systems with less good systems and you have something less good than the best.  But the advantage of averaging, I’d always thought, was that if you have a system “breaking” on a particular player, the averaged result across several systems helps you avoid going astray.  It seems like that benefit isn’t enough here.
-j


#79    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/22 (Tue) @ 02:30

Ugh ... I should have done a little more work before posting those numbers. I need to look at the predicative value, not the observed one.


#80    Tangotiger      (see all posts) 2011/02/22 (Tue) @ 04:17

Right, Charlie.

If you want to correlate past PA to wOBA to forecast to, the range for wOBA should be .310 to .350, if the league average is .335.

Basically, if you have 2000 past PA, you should regress closer to .345 or .350.  If you have 1 past PA, you should regress closer to .310.

And in-between, you pick the appropriate spot.

Marcel is not going to change to reflect this.


#81    .(JavaScript must be enabled to view this email address)      (see all posts) 2011/02/25 (Fri) @ 08:05

I have a zip file ready with the input data used for the tests. There is a csv file with all the projections, and a sql file which will create a table.

C = Chone
M = Marcel
P = Pecota
Z = Zips
O = Oliver (pretest)
O2 = Oliver (with adjustments made after Tango’s tests)

If anyone would like a copy to run their own tests on, write to me by clicking on my name.


#82    Tangotiger      (see all posts) 2012/11/28 (Wed) @ 10:11

Anyone going to repeat the work I did for 2011 and 2012 seasons?

It would be lovely if you would replicate exactly the process as I laid out.


#83    Rudy Gamble      (see all posts) 2012/11/29 (Thu) @ 00:31

I have 2012 projections for all those listed (minus CHONE of course) + a few others.  I focused on linking players who had been drafted in 12-team MLB fantasy leagues so most (but not all) of the player linking work is done.  Whomever is taking on this test can e-mail me at rudy-at-razzball-dot-com.

My review of 2012 projections is here:  http://razzball.com/fantasy-baseball-projections-review-2012/.  WAY different methodology than above as the test is fantasy baseball-focused. 

FWIW, after making playing time estimates equal across all sources, I found very little difference between hitting projection sources (PECOTA just edged out Oliver and CAIRO) while pitching has more differentiation with Fantistics/Steamer/CAIRO are the top 3.


Commenting is not available in this channel entry.

<< Back to main


Latest...

COMMENTS

Feb 11 02:49
You say Goodbye… and I say Hello

Jan 25 18:36
Blog Beta Testers Needed

Jan 19 02:41
NHL apologizes for being late, and will have players make it up for them

Jan 17 15:31
NHL, NHLPA MOU

Jan 15 19:40
Looks like I picked a good day to suspend blogging

Jan 05 17:24
Are the best one-and-done players better than the worst first-ballot Hall of Famers?

Jan 05 16:52
Poll: I read eBooks on…

Jan 05 16:06
Base scores

Jan 05 13:54
Steubenville High

Jan 04 19:45
“The NHL is using this suit in an attempt to force the players to remain in a union�