Monday, October 06, 2008
Evaluating the 2008 Forecasting Systems
David compares THT, Chone, PECOTA, and ZiPS.
Buy The Book from Amazon
David compares THT, Chone, PECOTA, and ZiPS.
The two markets for a forecasting system is MLB and Fantasy Baseball. Well, a third would be general baseball nerdiness.
There are two main components to a forecasting system: rate stats and playing time. Focusing solely on rate stats really only helps out the baseball nerdiness group. The playing time forecast is crucial, and really one where you need more manual intervention (thereby reducing the “system” to “how I interpret the scuttlebutt").
In any case, I hope that more researchers evaluate the forecasting systems like David did, so that we can stop with the nonsense that one system is better than the other. As I’ve said in the past, it is a very dubious claim, akin to saying a team that wins 85 games is better than a team that wins 84 games. Marcel will win 83 or 84 games, the good forecasting systems will win 84 or 85, and the not so good will win 81 or 82.
Peter, wouldn’t using FIP as the evaluation measure be sort of missing the point? The ultimate point of a projection is, it seems to me, to try to predict how many runs a pitcher will surrender. It seems to me that FIP is a methodology try to improve such projection results by accounting for our knowledge that recent past FIP is a better predictor of a pitcher’s future performance in surrendering runs than recent past ERA. But that doesn’t make FIP the end in itself.
On a separate note, it might be interesting to test these various results against something more like a control group prediction system. Marcels might be one such control group, as it uses fewer detailed adjustments than the more sophisticated systems. Even more of a control group might be a system that simply projects a player to have the same performance as the year before, with no adjustments for regression to the mean, aging, or anything else. Using these control group studies might help demonstrate the value added of the various adjustments being made by the more sophisticated systems.
Lastly, are there measures of prediction success other than average variation that might be as or more illuminating? Is a prediction that errs by around 15% on most players more or less useful than a system that gets half its projections exactly right but misses half its projections by 30%? Would it be useful to know if some projections are erring more consistently than others on the high side (or low side)?
birtelcom - The usual method of doing projections is to try and project a player’s true talent and then adjust for the specific circumstances that a player plays in. For most projection systems that adjustment is limited to a home field park factor. If you want to accurately project a pitcher’s ERA you also have to adjust for the quality of the defense on his team. One can certainly do so, but it adds a layer of complication to the projection and includes in a pitcher’s projection a factor that is beyond the pitcher’s control. Using FIP as a measure of projection success limits what is being projected for a pitcher to only the pitcher’s abilities.
Your third paragraph has excellent suggestions. It would be enlightening to know how many hitters’ and pitchers’ performances fall within some fixed margin of error of his projection for each system. It also would be helpful to know if a projection system has a systematic bias error. It appears from the following sentence in David’s study that he has normalized the projections for each system to the year’s average OPS and ERA. I am not sure that normalizing is the correct thing to do or that David’s method of adding or subtracting a fixed amount from each player’s projection is the best method of normalizing.
I compared each player’s projected OPS to his actual OPS, adjusting the numbers so that averages were equal. So if we had three players with OPSs of .800, .850, and .900, and their projections had been .825, .875, and .925, the projections would each be adjusted down 25 points.
birtel: I am always surprised when Marcel is not one of the forecasting systems when these evaluations take place. Nate included it last year, as did Chone. Not sure why David did not include it, but David sent me his data, so I can offer my two cents later in the week after I include the Marcels.
And to preempt the issue, the official position of Marcel is that the forecast for every human being on the planet to be a league average player with 200 PA or 25 IP. It is frankly a wasted effort to produce one thousand or one million identical lines of performance, which is why Marcel doesn’t bother. However, the explicit statement should suffice.
***
The analogy to ERA is R, RBI (orR+RBI-HR per… out, or PA.. out I think). The analogy to OPS is component ERA.
I did a comprehensive study for the 2007 forecast here:
http://www.insidethebook.com/ee/index.php/site/comments/community_forecast_2007_preliminary_results/#29
http://www.insidethebook.com/ee/index.php/site/comments/community_forecast_2007_pitcher_results/
I discuss how to handle the “normalization” issue, as well as how not to remove players based on playing time.
First, let me say that I am also in the camp that finds it silly to compare projection systems to see who is “better.”
For one thing, one can easily choose the methodology (average squared error, average error, correlation coefficient, etc.) that makes one system look better than the other. I am not saying that David did this, but it certainly could be done.
For another, if you want to make sure that your system rates well after the fact, there are certain tricks you can use. One is to never project someone with a very low OPS or high ERA. Even if those projections are correct (that is around that player’s true talent), if that player happens to play around his true talent for the first 100 PA or TBF or so), he will likely get no more playing time, and those players will drop out of the samples being evaluated at the end of the season or carry very little weight (for those evaluators who use all players, but weight each comparison by playing time). IOW, there is a playing time versus performance bias.
There are other “tricks” to making sure that your projections do better at the end of the season. Here is something that David and I discussed this summer that I found fascinating. It never occurred to me until David mentioned it (the old can definitely learn from the young):
Let me see if I can even explain this correctly. It is a little difficult.
Let’s say that you are projecting minor league players using their MLE’s. What mean do you use for regressing their weighted average MLE’s (setting aside their height, weight, position, age, etc.)? I normally use the mean of all rookie players in MLB. That is not correct. If you want to estimate their true talent, all you know is that they are minor league players, so you are supposed to regress them to the mean of all minor league players (or whatever level they are at), and NOT the mean of all rookies players in MLB. That is what you want to do if you were advising, for example, a major league team, on who you thought were their best minor league players, based solely on the numbers. And your projections should be reasonably accurate if you looked at their performance next year, for all the players you projected.
But…
If you were being evaluated based only on those players who made the majors and got significant playing time in the majors, your projections would be way too conservative! These are a biased group of players on two fronts. One is that the teams and their scouts determined that these were their best minor league players even if they didn’t necessarily have the best stats in the minors. And two, they likely got lucky (or at least do not include the unlucky ones) in their first 100 PA or so in the majors to warrant more playing time.
So, if you are projecting minor leaguers in order to be more “correct” after the fact - after they do in fact play in the majors - and those are the only ones that are going to be evaluated - then you definitely want to use the rookie MLB mean as the number to regress to, even though that technically will not yield a good estimate of their true talent, regardless of whether they make the majors or not.
IOW, if you want to do well on these evaluation systems don’t ever project a player like Blake DeWitt to hit .556. It ain’t gonna happen. IF he hits that or less in his first 100 PA, he is banished to the bench or to the minors.
As far as using FIP or ERA, it depends on if the forecasters are projecting ERA including team defense or not. If they are, then it is unfair to use FIP to evaluate them. If they are simply projecting context neutral pitching talent, adjusted for park only (and not team defense), then yes, you should use something like FIP. I think that most of the forecasters try and project ERA including park and team defense, but I am not sure.
MGL: right, that’s why Marcel’s official forecast for any rookie to be exactly league average. That is closer to the truth of where their sample performance will be (the rookie league average, which is around 50 OPS points below league average) than their actual true talent.
The bias of number of PA cannot be overstated. And the lack of PA in forecast evaluations needs to be highlighted.
Well, I don’t have any problem, per se, with forecasters not forecasting PA, only performance rates. It is what it is. If I give only rates, I am essentially forecasting true talent (presumably). If I give a playing time forecast also, I guess I need to say whether my performance rate forecast incorporates the playing time and the ensuing selection bias or not, or they are two separate things. If they are two separate things, then the performance rate forecast will probably be a little low as compared to actual performance. If I am trying to incorporate selection bias, then I have to “bump up” my performance projections a little, or my methodology for my forecasts has to include this selection bias (Pecota should, for example, since it uses real live players to approximate a player’s future performance, one of the advantages, BTW, of that method).
Bottom line, as I said, is that these comparison evaluations are pretty worthless, at least with respect to similar forecasting systems. If, however, we were comparing, say Pecota or THT to Steve Phillips’ and Buster Olney’s player projections, then the results would probably be telling.
The reality is that you are not really forecasting true talent because that is something that we can never know. So, we’ll never be able to evaluate it!
MGL: Braun’s true talent is to hit 32 HR in 650 PA
Me: But he hit 37 in 663.
MGL: Well, he got a bit lucky.
Me: How can we know that?
MGL: Well, we can’t know that for Braun in particular, but we can know that for 20 hitters like him.
Me: Well, I want to know how many HR Braun will hit in 2008.
MGL: I can’t know that. I can tell you how many he should hit on average, but you can’t evaluate my statement, because god hasn’t told us how many he really should hit on average.
So, in order to be part of the discussion, we have to say how many HR he will hit, and evaluate that against how many HR he actually did hit.
Otherwise, we can say that QUATLU sells for 10$ a share today because we expect it to sell for 11$ one year from now, and 12.10$ two years from now, and we expect a 10% annual discount rate, and then when it actually sells for $13 in 3 months, what are we going to do?
At the time you make your forecast for QUATLU it was based on the best forecast you can make given the fundamentals and technicals (its “true” value) but the sampling over 3 months (i.e., real-life stuff) shows that it got “lucky”. Or you screwed up your original evaluation.
So, I have to say that anyone who supplies forecasts must do so with the expectation that it will be evaluated to the end-of-year results. It’s the only way to evaluate them.
David provided me with the data he used. I come to a different conclusion. The way I do the analysis is the following:
1. Figure out the OPS baseline for each forecasting system. Using simple averages of players with at least 200 PA, I get:
OPS THT PECOTA CHONE ZiPS
0.760 0.776 0.769 0.769 0.762
2. Adjust the OPS of all players so that they have the same common baseline. So, I knock out 16 OPS points from THT, 9 from PECOTA and Chone, 2 from ZiPS. The reason is that you treat each system as its own universe, and therefore, we don’t care at all about the exact OPS, but rather the OPS relative to all players. Ideally, each of the systems would provide the baseline runs per game so that I can do a better job of baselining. Using the sample data to reverse-engineer the baseline is acceptable.
Also, I should weight by PA, but, I’m just doing something quick for the moment.
3. Compare the adjusted forecast to the actual OPS. Take the absolute value of this error.
David on the other hand first figured the absolute error, and then did the adjustment. I don’t agree with his method.
As a result, I get the following results:
THT PECOTA CHONE ZiPS
0.066 0.065 0.062 0.063
***
We also CANNOT take the RMSE. RMSE not only accounts for the intercept (which is good, which is what I did in step 2), but also alters the slope! This is ridiculous. For example, I could take everyone’s forecasted OPS, divide by 10 (forecast Pujols with an OPS of 0.100 instead of 1.000), and run a regression. Guess what the correlation will be with actual OPS? The EXACT same thing as if I didn’t divide by 10.
To the extent that we believe that everything moves rather linearly, and we pay for things by linear difference, we cannot now start to apply a multiple on the slope to best-fit the forecasted data to the actual data. You can alter the intercept (which is what I do in step 2). You cannot alter the slope (which changes the $ per win valuation).
***
Anyway, now that David was nice enough to give me the data, I will include the Marcels (which should really be the standard in all evaluation forecastss), as well as the PA and IP for weighting purposes.
I compared each of the absolute errors to each other, to see who did better in head-to-head. For example, Pujols’ OPS was off by .118 for THT and .121 for PECOTA. I called that a tie. ZiPS was off by .091. I call that a win for ZiPS v PECOTA.
I did this for the three forecasters against PECOTA, head-to-head, and this is what I got comparing 329 players:
THT v PECOTA: 90-88, 151 ties
Chone v PECOTA: 80-57, 192 ties
ZiPS v PECOTA: 77-67, 185 ties
By this measure, all three systems beat PECOTA. And Chone simply trounced PECOTA. It’s hard to tell with the absolute errors (who the f-ck knows the implication of .062 compared to .065?). But, this head-to-head, which is REALLY what we want anyway, makes things crystal clear.
That said, the “trouncing” is simply with 23 players. Giving each of them half a win for the tie, you get these win%:
0.503 THT v PECOTA
0.535 Chone v PECOTA
0.515 ZiPS v PECOTA
With 329 “games”, 1 SD is 28 points. So, even the win of Chone v PECOTA isn’t statistically significant.
Since Chone did the best, here’s how everyone did matched up to Chone:
0.552 Chone v THT 81 47 201 181.5 329
0.535 Chone v PECOTA 80 57 192 176 329
0.505 Chone v ZiPS 73 70 186 166 329
Finally, ZiPS against the other 3:
0.532 ZiPS v THT 88 67 174 175 329
0.515 ZiPS v PECOTA 77 67 185 169.5 329
0.495 ZiPS v Chone 70 73 183 161.5 326
So, based on this, the best to worst systems are:
Chone
ZiPS
PECOTA
THT
But, other than being reasonably sure that Chone is a tiny bit better than THT, the rest are all pretty close.
Remember, even though Chone is 2 SD from THT, all that means is that the actual difference between the two is non-zero. It is NOT a 52 point difference in win% as the empirical data shows.
Agreeing that Chone is the best, Chone was off in its forecast by at least .100 OPS on 54 of the players. Among those players that Chone was clueless on, THT did not have any player forecast within 50 points, and neither did ZiPS. PECOTA only had one player (Miguel Tejada).
PECOTA was off by 61 players by at least .100 OPS. Among those players,
THT was off by under 50 points on 3 players:
Chris Young
Omar Quintanilla
Michael Bourn
Chone 1:
Matt Stairs
and ZiPS 5:
Jeff Kent
Matt Stairs
Chris Young
Omar Quintanilla
Michael Bourn
Seeing the overlap in names of most of these players, this points to a possible bias in PECOTA. It’s possible that these guys don’t have good comps, and therefore, PECOTA overweights their poor comps, and spits out non-good forecasts.
Adding up all the numbers, here’s what I get for hitters:
win% system W L T
0.530 Chone 234 174 579
0.514 ZiPS 235 207 542
0.482 PECOTA 212 247 528
0.473 THT 204 257 526
Repeating the same steps for pitchers, with at least 50 IP (282 pitchers), we have this:
THT PECOTA CHONE ZiPS
0.907 0.860 0.859 0.880
Once again, Chone is on top, a smidge ahead of PECOTA. Head-to-head against Chone, with a tie being any ERA difference within 0.25 runs, we have this:
0.525 Chone v THT 60 46 176
0.473 Chone v PECOTA 44 59 179
0.535 Chone v ZiPS 80 60 142
That’s win%, W, L, T.
So, while the overall error between Chone and PECOTA was the same, PECOTA did much better than Chone, head-to-head.
Here’s PECOTA head-to-head against all:
0.535 PECOTA v THT 71 51 160
0.527 PECOTA v Chone 59 44 179
0.514 PECOTA v ZiPS 71 63 148
And:
0.504 ZiPS v THT 75 73 134
0.486 ZiPS v PECOTA 63 71 148
0.465 ZiPS v Chone 60 80 142
And:
0.465 THT v PECOTA 51 71 160
0.475 THT v Chone 46 60 176
0.496 THT v ZiPS 73 75 134
For an overall:
win% system W L T
0.525 PECOTA 201 158 487
0.511 Chone 184 165 497
0.485 ZiPS 198 224 424
0.479 THT 170 206 470
***
Adding up the hitters and the pitchers, and we get:
win% system W L T
0.522 Chone 418 339 1076
0.502 PECOTA 413 405 1015
0.501 ZiPS 433 431 969
0.476 THT 374 463 996
Chone, among these 4, is the clear leader. I hope no one tries to tout their system as the best, other than maybe Chone.
I will perform this analysis for any other person who wants to provide me with the forecasts of other systems. You must, MUST, include the Retro, BDB, or BIS player ID.
Marcel will be forthcoming. I have no idea how Marcel did, so I am looking forward to see how well it did.
Tango, are you adjusting for how each projection system predicted the overall offensive level?
cann: see post 11, point number 2.
When you are only evaluating your forecasts for major league players, you can “cheat” by not projecting anyone to be below major league replacement level, and do a little better in these forecasts.
A real test for the systems that try and predict a huge number of minor leaguers, like Dan, Nate, and myself, is compare the projections to the combined MLB and MLE data for these players.
Of course it’s a little tricky in that there are several MLE calculations out there, it’s not an objective standard that we can’t dispute like a major league batting line, but it would be interesting to see.
As for regression, just as you don’t want to regress a minor leaguer to the major league average, you don’t want to regress a player in low A to the same mean as a AAA player. CHONE knows what level the data is coming from, and adjusts the regression based on that.
I’m not going to brag about these results, I don’t think there’s a huge difference and they are all good systems, but if some team, or fantasy baseball outlet, or whomever wants to hire me or buy the projections for exclusive use, I’m willing to sell (email rallymonkey - numeric five- at sign - comcast dot net. Or if not, I’ll just give them to everyone for free this winter.
There’s no question that you can “cheat”, if you know beforehand that the player in the evaluation will get at least 200 PA, and if he gets less, then he’s out of the picture.
For example, there were 28 hitters with 200 to 249 PA in the study. Of those, most were expected to get limited playing time. I’ll exclude Travis Hafner and Andruw Jones. So, we have 26 hitters who received between 200 and 249 PA, we didn’t really expect them to get more, and here’s the forecast for those 26 hitters by the 4 systems:
THT PECOTA CHONE ZiPS
0.712 0.717 0.715 0.704
So, all 4 agreed that they are below-average hitters, but that they are probably decent hitters.
The actual OPS of these hitters was .660. So, these 4 systems really missed the boat.
But, if we look at the guys who had 250 to 299 PA (excluding Victor Martinez and Frank Thomas), we have 29 hitters, and this is their forecast:
THT PECOTA CHONE ZiPS
0.732 0.730 0.729 0.718
So, these systems agree that these hitters are a bit below average, and deserve to be part-time players. Their actual OPS was .717.
What happens is that if you include guys with too few PAs, their actual OPS will simply undershoot what you expected. And so, if you know that the boundary line would be say 250 PA, then you know to forecast players with at least a .680 or .700 OPS, because you know that he won’t be allowed to get that many PA is he’s hitting terribly.
However, in practical purposes, does this help? If I change all of the Chone forecasts such that the minimum OPS forecast is .675, the absolute error of all players remains at .062. If I set the minimum to .700, I get the same absolute error. Indeed, I have to change the minimum OPS to .750, to get the absolute error to .064. And even then, that’s still better than PECOTA!
So, while I agree in principle with Rally, the problem is that we really don’t know who will get the PA, and therefore, there’s really no way to cheat. If one has doubts, one needs only to test it.
With all due respect to David G., I find it a little troubling that he uses an evaluation system that finds that THT is the best or one of the best forecasting systems, and Tango, who is unbiased, uses a system that finds that THT is the worst. While I am not suggesting that David purposely used a system to make his (THT’s) projections look better, I think the clear moral of the story is that you should ignore the results, or at least take them with a large grain of salt, when one of the forecasters evaluates the forecasters.
If nothing else, you may have “publishing bias.” Let’s say that I am thinking about writing an article about how well the various systems did. And let’s say that I have a chance to use 3 or 4 different methods to evaluate the forecasts. And let’s say that I randomly choose one or the other system (or at least I choose one for whatever reason, not knowing which of the 3 or 4 is better or worse for my forecasts). If I happen to choose a method of evaluation that makes my forecasts look good, what do you think the odds are that I end up publishing my study or writing my article versus if I happen to choose a method that makes my system look bad?
There is certainly a publishing bias. You don’t hear about guys who forecasted great things six months ago for the stock market in October, do you?
David, to his credit, gave me all of this data that he used. And I used only that data.
The lesson, above all, is that all researchers should provide the data of their research, if there is a conflict of interest. If Nate Silver, or Dan, or Rally undertake a similar exercise, it would behoove them to make the underlying data public.
Not to sound all snooty about it, but isn’t it better for Rally to post this quote:
“Chone, among these 4, is the clear leader. I hope no one tries to tout their system as the best, other than maybe Chone.—Tangotiger, Oct 7, 2008”
Than for him to post:
“The most super-deadly-accurate forecasting system. —Myself, today”
That’s why you want someone unbiased like me to do these evaluations. Even though I am trustee of a competing system (Marcel), I have publicly stated that I would PREFER that other systems beat Marcel. No one with self-interests would say that.
***
The quote is a not-so-subtle shot at the way BP handles itself in this matter. Nate is very upfront in his posts on this issue, as is Tom at Diamond-Mind when he would do his analysis. So, these guys, left to express themselves without someone hovering over them, are honest guys, who will engage honestly in this matter. They both make it very clear to take the results with a grain of salt, since they are all pretty close.
But the BP marketing arm, to actually put the words they do on their book, simply resorts to an obfuscation. Indeed, their yet-to-be released 2009 annual ALREADY has the following words on their cover:
“Featuring Deadly Accurate PECOTA Projections for More Than 1,600 Players”
Those are the same words as their 2008 Annual. Read their quote. Taken literally, it means that they have deadly accurate (i.e., nearly perfect) forecasts for 1600 players for the upcoming season. First of all, how do they know it will be accurate? They can only make the claim AFTER the season is over. Secondly, no forecast can be “deadly accurate”. Finally, they offer no evidence of the most recent season for their claim. Indeed, the current analysis shows that they are pretty ho-hum.
What they should say:
“Featuring the time-tested and well-received PECOTA forecasting system, applied to more than 1,600 players.”
That is a clear statement. The statement they trumpet on their cover is an obfuscation.
http://en.wikipedia.org/wiki/Obfuscate
Why do I feel like Bill James against Elias?
Just to be clear, I would I have published the results no matter what they showed - I like to think, at least. Certainly not all the results I published were favorable to THT. Frankly, I thought what I did was mathematically equal to what Tango does - obviously not. I now see where I erred, but I don’t necessarily know that one way or another is better. Either way, I think the ultimate conclusion is that the systems are all pretty close, and either way, THT has some room for improvement.
Obviously, if I wanted to obscure something, I wouldn’t have sent the data to Tom.
Right. I think MGL only used you as an example, not necessarily trying to impugn your character.
I definitely think you are wrong. You did this:
=AVERAGE(W2:W330)-ABS(AVERAGE(S2:S330)-AVERAGE($R2:$R330))
which for people who don’t have your spreadsheet is this:
=AVERAGE(abs(forecast - actual)) - ABS(LGforecast - LGactual)
What I’m saying needs to be done is this:
=AVERAGE(abs((forecast-actual) - (LGforecast-LGactual))
You are taking an absolute error and then subtracting an absolute error. This is wrong.
You need to first take the two differentials (player and league), then subtract the two, THEN take the absolute value. Only once can you take the absolute value.
Right, you are correct. For whatever reason, I had thought the two equivalent.
Of course I am not trying to impugn David’s character. For the record, I know him fairly well, and he is one of the most intellectually honest, humble, and circumspect analysts out there (and I hope he stays that way).
But I stand firmly by what I said in my above post.
For the record, I do little studies all the time and if I don’t come up with certain results, I drop the issue, or I don’t complete or publish the article, for various reasons, not the least of which is that there is often nothing left to publish or what I would have to publish is not interesting! That is the essence of publishing bias. It does not have to be an intentional act on the part of the researcher to fool the public or make himself good good (or someone else bad), or some other nefarious reason or agenda. It is a natural part of research and often should be taken into consideration. Conclusions based on sample data should always be treated cautiously.
As well, it is ALWAYS better for someone neutral to do an evaluation, no matter how honest a person is. We cannot help being biased when we are evaluating ourselves. In some cases, we end up being biased against ourselves, like the Little League coach who plays his son LESS than the other kids for fear of being criticized by the parents.
We also CANNOT take the RMSE. RMSE not only accounts for the intercept (which is good, which is what I did in step 2), but also alters the slope! This is ridiculous. For example, I could take everyone’s forecasted OPS, divide by 10 (forecast Pujols with an OPS of 0.100 instead of 1.000), and run a regression. Guess what the correlation will be with actual OPS? The EXACT same thing as if I didn’t divide by 10.
I do not understand this. How about this formula:
=POWER(AVERAGE(POWER((forecast-actual) - (LGforecast-LGactual), 2)), 0.5)
Isn’t that the RMSE?
make himself good good
should be ”look good” of course…
dcj - What I think Tango is saying is that using RMSE changes the relationship of the numbers to each other.
I tried testing this on some data I had handy (RA/BsR/ERA for pitchers) though, and I can’t figure out what he’s talking about. RMSE for two sets of data divided by 10 is RMSE/10 for the two sets of data, same as Absolute Average Error like he’s using here. I don’t really have a nose for these sort of things, though, so it’s as likely (if not more so!) that I’m misunderstanding the point.
I was confusing myself with RMSE and correlation. The correlation alters the slope. RMSE, the square root of the sum of the squared differences, is what it is.
I don’t know that we want to know that above the absolute error. Why not cubic?
I don’t really know the answer to that question, Tom. I can sorta describe the differences, but I have no idea what’s preferable.
Let’s take an example, using the following as our “errors”:
10
10
10
5
10
15
Same average error for both, right?
The RMSE for the first is 10, same as the AAE. The RMSE for the second, however, is 10.8 - that’s because RMSE pays more attention to the largest outliers.
That said, I have no real arguements for why the RMSE is more “correct” for the above example than AAE. I tend to use AAE myself, because I think it makes more intuitive sense; if I can’t explain why we’re squaring the numbers and then taking the root then I don’t expect my readers to believe me that it’s the correct thing to do.
Maybe somebody like Andy or Pizza can weigh in on this.
Remember what the RMSE does: it overweights each error based on its value.
Other than the relationship that RMSE has to the normal distribution, I don’t see why we want to evaluate a measuring system based on how far away the error terms are by giving the larger error values more weight.
Like I said, why not cube it? Just because we want to fit it into a normal distribution so that we can explain the amount of variance, I don’t see that as the reason to do it.
I’m sure the bright minds around here can explain the reasoning.
Well, wikipedia can’t explain why we are squaring instead of just using absolute value of error.
“Like variance, mean squared error has the disadvantage of heavily weighting outliers.[5] This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones. This property, undesirable in many applications, has led researchers to use alternatives such as the mean absolute error, or those based on the median.”
From practical experience, you always want to use square errors instead of absolute errors because square errors are much easier to minimize algorithmically. Square errors result in nice, smooth functions over your parameter space, whereas absolute errors have areas with discontinuous derivatives. When you are minimizing with respect to, oh, 50 different variables, having a discontinuities in the derivative will kill you.
But that’s just experiential, I don’t have a theoretical or fundamental reason.
Say we have only 2 players, Player A and Player B. We are projecting OBP.
Player A: projected .360, actual .350
Player B: projected .320, actual .340
Consider a coordinate plane, where the x-axis is Player A’s OBP and the y-axis is Player B’s OBP. The projection corresponds to the point (.360,.320). The actual performance corresponds to the point (.350,.340). The closer together the two points are, the better the projection.
The distance between the two points is
sqrt( (.360-.350)^2 + (.320-.340)^2 )
by the Pythagorean Theorem. This is exactly the RMSE (well really it is sqrt(2)*RMSE).
If there is a third player, we have to do the whole thing in three dimensions. The projection is a point in 3-dimensional space, and so is the actual performance. The RMSE is the straight-line distance between the two points, divided by sqrt(3). The sqrt(3) is a normalizing factor.
Suppose for example that the projection was (.300,.300,.300) and the actual was (.310,.310,.310). We want the RMSE to be .010. The straight-line distance is
sqrt( .010^2 + .010^2 + .010^2 ) = sqrt(3)*.010
so it is appropriate to divide by sqrt(3).
When we are dealing with 329 players, RMSE gives the normalized distance between two points in 329-dimensional space. If that makes your head hurt, go back to the 2-player example.
This post is getting long, so I’ll stop here and then write part 2.
Part 2: comparison of RMSE and AAE. I’ll stay in two dimensions this time.
In the example above, Player A hit .350 and Player B hit .340. That corresponds to the point (.350,.340) in 2-dimensional space.
We projected (.360,.320). Imagine a circle having center (.350,.340) drawn through the point (.360,.320). All points inside the circle are closer to the actual performance than (.360,.320) is. All points along the circle are the same distance from the actual performance as (.360,.320) is.
If someone else comes along with another projection, we want to know if it was better or worse than ours. When we measure by RMSE, we are saying: projections that fall inside the circle are better than ours, projections that fall outside the circle are worse, and projections that fall along the edge are exactly as good.
This is because the RMSE of a projection is directly proportional to its straight-line distance from (.350,.340).
Now I’ll do the same type of thing for AAE. The projection was (.360,.320), so the AAE was .015. Here is a list of other projections that would also have an AAE of .015:
(.350,.370)
(.351,.369)
(.352,.368)
...
(.379,.341)
(.380,.340)
(.379,.339)
...
(.351,.311)
(.350,.310)
(.349,.311)
...
(.321,.339)
(.320,.340)
(.321,.341)
...
(.349,.369)
(.350,.370)
If you plot these points on the graph, you get a diamond whose four corners are (.350,.370), (.380,.340), (.350,.310), and (.320,.340). Of course, (.350,.340) is at the center and (.360,.320) is on the edge.
The diamond represents all possible projections having an AAE of .015. Points inside the diamond have an AAE of less than .015, and points outside have an AAE of more than .015.
So, let’s say we have a point P representing the actual performance, and points A,B,C,D representing different projections. Ranking the projections by RMSE is like drawing circles centered at P and going through each of A,B,C,D. (That is, one circle goes through A, another through B, etc. and all of them have P as their center.) The smallest circle is the best projection by RMSE. Ranking the projections by AAE is like drawing diamonds instead of circles.
In three dimensions (that is, with three players) the circles become spheres, and the diamonds become octahedra.
Conclusion: there is no right answer between RMSE and AAE. RMSE is used because it’s easy to calculate (#34) and it has this beautiful geometric interpretation in terms of distance. Also, like Colin said, it seems reasonable to give an extra penalty to distant outliers.
dcj: That’s brilliant. I have a very limited statistical background yet your explanation made perfect sense to me. Thanks.
I’ll second that response, dcj’s post helped me (a non-mathematician) understand rmse.
I think dcj is right that there’s “no right answer between RMSE and AAE.” I think in this case, using AAE makes more sense. I could be convinced otherwise, but I don’t see why we necessarily want to penalize the system that is making a few large errors more than the system that is making lots of small errors if they have the same average error. I think each player’s projection can be looked at in isolation.
RMSE seems like something that would be useful if we were trying to determine which players were similar to one another. You’d probably consider two players with a 25 point difference in both AVG & ISO to be more similar to each other than two players with the same AVG, but a 50 point difference in ISO.
But, I don’t see why you’d care one way or the other about a projection system that was off by 25 points on each of 2 players versus a system that was off by 50 on one, but exactly right on the other.
But, I don’t see why you’d care one way or the other about a projection system that was off by 25 points on each of 2 players versus a system that was off by 50 on one, but exactly right on the other.
Agree, with the caveat that if we had a way of identifying players for whom the latter projection was more likely to be right, you might mix and match. For example, imagine System A is off by 25 points for every player, but Sytem B can exactly nail half of the players but is off by 50 points for the others. If I can identify the half of the players that System B nails (as in, say, players over age 29 or whatever), I would use System B for those players, and System A for the others.
I believe that mean squared error is the correct metric to use precisely because of its connection to the normal distribution.
As everyone here probably knows, if we believe that each player has a true talent level in a given season, then the difference between our projection and the player’s actual performance will result partly from our failure in estimating the true talent level, and partly from sampling error. The sampling error has a binomial distribution, which approximates a normal distribution. And in that binomial distribution, two small errors are more likely than one big one. Bayesian inference then tells us that the hypothesized true talent levels that would result in two small errors in our observed results are more likely than the hypothesized true talent levels that would result in one larger error.
For example, say we have four players, and analyst A project their batting averages to be (.330, .310, .250, .250), Analyst B projects their batting averages to be (.355, .285, .275, .225), and that their actual performances over a full season are (.305, .285, .275, .275).
So both analysts got the overall average dead on (.285) and both have a mean absolute error of .25 points - in one case off by 25 points on all four players, and in the other case exactly right on two players, but off by 50 on each of the other two.
But the conditional probability of seeing the actual results under the condition that Analyst A’s projections of true talent level were correct is greater than the conditional probability of seeing those results under the condition that Analyst B’s projections of true talent level were correct. So, if we start with the a priori belief that Analyst A and Analyst B were equally good at estimating true talent level, then, as good Bayesians, our a posteriori belief should be that Analyst A has done a better job.
I think I got that right. In practice, I doubt it matters very much for these purposes.
Andeux, nice explanation which I think is exactly correct.
I started as a math major in college, but I’ve forgotten a lot since then, so this won’t be a mathematical explanation.
Now I work in mapping. We use aerial photos to create stereo (3d) images, and then estimate the elevation of the ground at various xy locations. Then the software creates a surface which estimates the elevations between our measurements.
Our measurements are not “accurate”, they are an estimate, which can be affected by the height and visual quality of the photography, the arrangement and processing of the surveyed control points we use to orient the photos, and the eyes and skill of the compiler. For a given flying height of the photgraphy, there is a tolerance within which we are expected to be able to measure. Typically, I can measure a clearly visible point to about 0.07 feet vertically, but the surface interpolated between points is allowed to be up to 0.17 in error, as measured with rms.
Our maps are used by civil engineers to build things like highways and airports. They can accept if I’m 0.1 or 0.2 feet off vertically, high or low, but I damn well better not be 0.5 of a foot, that will screw them royally. So rms is the method prescribed by our professional association, as it is sensitive to outliers.
In baseball, we have sample errors. Eric writes today at BP that Ethier’s BABIP was .355 one year, .304 the next. The year before, and the year after, were both about .330. It’s a sampling error, palyers can have random fluctations even over 162 games. If we can get the vast majority of our projections inside the size of the sampling error, then I think we’ve done well. If we have too many outliers, rms will balloon up.
That might be an idea for a secondary measure - what percent of the projections erred by more than a standard amount.
#37 and #38, thanks!
#40, that is a great point. I never thought about it that way.
This reminds me of the distinction between ability and value stats. So far, I was thinking about measures like RMSE and AAE as “value stats” for projection systems. Neither one is inherently superior to the other. Which one you prefer depends on what you are using the projections for. (Like in #42, RMSE is chosen because you want sensitivity to outliers.)
Now, andeux is considering RMSE and AAE as “ability stats.” The projection with the lower RMSE is more likely to be based on a better algorithm, so it is more reliable going forward.
Very interesting stuff, even if it doesn’t make much difference in the end.
--
To change the subject, I would really like to see a projection system with reliable error bars. I know that PECOTA has them, and ZiPS gives the 15% and 85% projections for selected players, but I don’t trust those numbers. I am remembering Tango’s excellent arguments against the PECOTA percentiles.
This is tied into the issue of playing time that MGL was talking about earlier in the thread, and the question of whether to project true talent or actual performance. Even if we are projecting true talent, we care about both the rate of production and the fraction of the season that the player is healthy. Those are not going to be independent. Then for the actual performance, there is selection bias to consider.
Maybe the reliability ratings in Marcel are a good place to start.
Of course, you also have to specify whether you want “error bars” around your true talent estimate, around the projection, taking into consideration the random fluctuation around that true talent estimate or both.
For example, I might project a veteran player to have a true talent OPS of .800. I might say that my estimate is plus or minus 50 points with 95% confidence or whatever. Those are the error bars around my estimate of his true talent. They will be larger for players who have less historical data and smaller for players with more historical data.
In fact, if I have an unknown player that I estimate to have a true talent of “league average” (of course), my variance (error bars) around that estimate will be exactly equal to the variance of talent in whatever population I include this guy in.
Now, if I also want to include the “error bars” of his actual performance, which is I think what Pecota does, then I have to add in the random variance around his playing time. If I project a player to have a little playing time, then that variance (error bars) will be higher (larger). More playing time, lesser variance.
So, for example, if I have an unknown player that I project at .740 OPS (league average). I have to make the variance (the error bars) on that projection at whatever the spread of OPS true talent is in the population. Say 1 SD is 75 points. So the variance is .0056 (.075 squared) or 5.6 points.
Now, say I project that player to have 300 PA. The random variance on that, assuming, for the sake of argument, that OPS is a binomial, is 29 points squared, or .0008 or .8 points.
So the total variance for my projection of 300 PA at league average OPS for this guy is the sum of the two variances, or .0065. The square root of that is 80 point in OPS! That is one SD around my unknown league average player for 300 PA.
.750 OPS plus or minus 160 points at the 95% (2 SD) confidence level. That includes the uncertainty of my estimate of his true talent PLUS the random fluctuation around his true OPS in only 300 PA.
So I would have to specify the variance or error bars, given a certain number of PA, if I use that methodology.
Now, if I don’t include that methodology, then I have to add further uncertainty, which is the number of PA he is going to get.
I don’t know how to add THAT variance in to the equation. I’d have to think about it.
Then there is the issue of selective zampling which would have to be handled, in that the more the PA the better the performance is likely to be, even given the same true talent. That is much more so for very old and young players.
And/40: it seems that what you are saying is that we can infer that the guy with the lower RMSE (even though they have the same absolute error) has done the better job, simply because the RMSE handles the sampling error, because the RMSE is tied to the normal distribution.
However, if we ignore sampling error, and simply go with “bets”, then it won’t matter who had the lower RMSE, since all the errors have a linear relationship, not quadratic.
***
Now, since I give a tie for anyone within 20 OPS, I in fact am arguing FOR the RMSE, since that would tie-in closer to RMSE than the absolute error.
The safe thing to do is to report all three: the absolute error, the RMSE, and the head-to-head win%. Personally, I find the last one the only relevant one, since that’s something “real”. I have no idea how much better absolute error of .062 compares to .065. Indeed, when I gave Rally all his low OPS forecast a blanket .725 or so (which means making ridiculous forecasts for the really bad hitters), he ended up with a .065 absolute error, tieing PECOTA. That would seem to mean that there’s a huge difference between .062 and .065. But, you can’t tell that at all.
The head-to-head win% is what it is. I highly prefer metrics on that 0 to 1 scale, with a mean of .500 (or 0 to 100, mean of 50).
Nov 20 01:43
Sabermetric Moves of the 2009 Pre-Season
Nov 20 14:38
Nate Silver: hero to interviewers
Nov 20 14:20
Marcel 2009 is here
Nov 20 13:42
Top Free Agent Pitchers
Nov 20 12:29
R.I.P. Tom Boswell, sabermetrician; P.A.L.L.(*) Tom Boswell, human being
Nov 20 12:27
David G. checks in again on whether experience matters in the post-season
Nov 20 10:42
Offense by position groups by decade
Nov 20 02:01
My 1B is better than your 1B
Nov 20 00:26
MLB logo
Nov 19 23:03
NBA’s Marcel
I was glad to see David break out young players and older players in his comparisons. While the differences in these areas for a single year may be due to luck, if a particular metric consistently performs better in multiple years it may indicate that the metric has found a better method for predicting MLE’s or aging. I would also like to see players who stay with the same team (from 2007 to 2008 and during 2008) compared to players who switch teams. That would give an indication of which metric might be better at incorporating park factors and/or league factors.
While OPS is an imperfect measure of overall batting ability it is probably sufficiently accurate for comparing differences between prediction systems. The same cannot be said for ERA however. Predicting future pitching ability is tough enough without having to adjust for defensive changes that might affect a pitchers future ERA. Some measure of FIP would be a better choice for comparison of the different predictors.