Monday, July 14, 2008
Regression, schmegression
Regression analysis is sabermetrics is probably the worst thing that has happened to its discipline. Rather than it being treated as a starting point, it’s treated as the target point. Even smart learned men make this goof. Nowhere is it more evident than in the regression run value of a double, as Patriot shows us here:
There’s alot of numbers in there, but let me highlight the pertinent points, as he shows us the best regression equation to estimate team runs scored over a fixed time period as:
Reg-4 = ..552S + .645D + .993T + 1.458HR + .353W [plus other stuff]
But, if you used hits and total bases (as opposed to 1b,2b,3b,hr), the run value of the 2B would be .805. See, total bases “cheats” by forcing the gap in run value between the 1b and 2b to exactly equal the run value between 2b and 3b. There’s no reason that this must necessarily be the case. It’s actually a very close approximation, but again, that’s cheating.
Anyway, the best regression (in this dataset) says that the gap in run value between the single and double is a measly .093 runs, a far far far cry from the best estimate of .30 runs if you look at the actual millions of play-by-play records, and not the hundreds or thousands of aggregated (and biased) team data.
But, what the heck do I know… I’m not highly educated enough. I’m just some schmoe.
And then Patriot really hits it when he says:
If you are concerned about the ecological fallacy, regressions are the methods that you should worry about. The best example is the sacrifice fly. From Ruaneās data, it is apparent that at the average scoring level of 1960-2004, the sacrifice fly is a neutral play from a run expectancy standpoint (-.01 runs). When that value is converted to absolute runs, it is worth about +.15 runs.
However, regression procedures know nothing about baseball reality. They only know about the combinations of numbers you give them, and the correlation between the variables. Sacrifice flies correlate decently with runs scored (better than triples, hit batters, or steals in this sample), and each sac fly is a guaranteed run for the team. You can see that the sac fly is evaluated as being worth more than a double, which is absurd on its face.The double is also only .1 runs more valuable than a single in the regression equation.
More important than regression equations (fraught with sampling issues, like selection and bias) is logic. We can create an excellent working model of a baseball game where we can show that the run value of a double should be roughly worth +.30 runs more than a single. We can use millions of play-by-play events, where we don’t create the horrible bias aggregation gives us (and further reduce our sample size) where again we can show that the gap should be around +.30 runs.
I even created a very simple one here, with the source code there for anyone to verify. This is the starting point, not regression.
Furthermore, the idea that you can create an equation based on sample data, and then “test” it against that same data is ridiculous. What kind of a test is that? All the test does is establish the best-fit. You need to test it against out-of-sample data.
If you see an analyst or academic shun logic in favor of regression, you can tell him that he’s dead wrong. And tell him to watch a game, and then he’ll understand why he’s wrong.