THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, July 14, 2008

Regression, schmegression

By Tangotiger, 11:31 AM

Regression analysis is sabermetrics is probably the worst thing that has happened to its discipline.  Rather than it being treated as a starting point, it’s treated as the target point.  Even smart learned men make this goof.  Nowhere is it more evident than in the regression run value of a double, as Patriot shows us here:


There’s alot of numbers in there, but let me highlight the pertinent points, as he shows us the best regression equation to estimate team runs scored over a fixed time period as:

Reg-4 = ..552S + .645D + .993T + 1.458HR + .353W [plus other stuff]

But, if you used hits and total bases (as opposed to 1b,2b,3b,hr), the run value of the 2B would be .805.  See, total bases “cheats” by forcing the gap in run value between the 1b and 2b to exactly equal the run value between 2b and 3b.  There’s no reason that this must necessarily be the case.  It’s actually a very close approximation, but again, that’s cheating.

Anyway, the best regression (in this dataset) says that the gap in run value between the single and double is a measly .093 runs, a far far far cry from the best estimate of .30 runs if you look at the actual millions of play-by-play records, and not the hundreds or thousands of aggregated (and biased) team data.

But, what the heck do I know… I’m not highly educated enough.  I’m just some schmoe.

And then Patriot really hits it when he says:

If you are concerned about the ecological fallacy, regressions are the methods that you should worry about. The best example is the sacrifice fly. From Ruane’s data, it is apparent that at the average scoring level of 1960-2004, the sacrifice fly is a neutral play from a run expectancy standpoint (-.01 runs). When that value is converted to absolute runs, it is worth about +.15 runs.

However, regression procedures know nothing about baseball reality. They only know about the combinations of numbers you give them, and the correlation between the variables. Sacrifice flies correlate decently with runs scored (better than triples, hit batters, or steals in this sample), and each sac fly is a guaranteed run for the team. You can see that the sac fly is evaluated as being worth more than a double, which is absurd on its face.The double is also only .1 runs more valuable than a single in the regression equation.

More important than regression equations (fraught with sampling issues, like selection and bias) is logic.  We can create an excellent working model of a baseball game where we can show that the run value of a double should be roughly worth +.30 runs more than a single.  We can use millions of play-by-play events, where we don’t create the horrible bias aggregation gives us (and further reduce our sample size) where again we can show that the gap should be around +.30 runs.

I even created a very simple one here, with the source code there for anyone to verify.  This is the starting point, not regression.

Furthermore, the idea that you can create an equation based on sample data, and then “test” it against that same data is ridiculous.  What kind of a test is that?  All the test does is establish the best-fit.  You need to test it against out-of-sample data.

If you see an analyst or academic shun logic in favor of regression, you can tell him that he’s dead wrong.  And tell him to watch a game, and then he’ll understand why he’s wrong.

#1          (see all posts) 2008/07/14 (Mon) @ 13:35

Hmmm ... you’d expect the double and SF to come out right in the regression.  One possible explanation is small sample size ... do the confidence intervals for those values include the known, actual value?

It could also be the “other stuff”.  If outs is one of the “other stuff"s, and SF is an out, then the regression value of the SF is actually the sum of the two values. 

Also, if you use team seasons as your regression data, they all implicitly contain the same number of outs (pretty much).  That means that ... more to come, I just got called away.


#2          (see all posts) 2008/07/14 (Mon) @ 13:39

You should check out the apbrmetrics (NBA) site.  It’s pretty much all regression all the time.


#3    Patriot      (see all posts) 2008/07/14 (Mon) @ 14:05

Phil, I’m not a statistician, so I’m not sure how to find the confidence intervals.  Minitab didn’t spit out anything recognizable to me as a confidence interval, but I don’t really know what I’m looking for.

However, the results are similar to every long term regression on team seasonal stats I’ve ever seen.  In Curve Ball, Albert and Bennett found a very high value for the SF (I don’t have the book in front of me, but IIRC it’s around .6).


#4    dave smyth      (see all posts) 2008/07/14 (Mon) @ 14:06

Here is the regression equation from M Schell, for all teams 1947-2003.

.52*1b + .80*2b + 1.11*3b + 1.52*HR + .32*(BB+HBP) - .111*(AB-H)

To me, this looks much better for the 2b. I’m not sure what is the difference of Schell’s technique, but he does state that “each component has been adjusted for its own ballpark effect, rather than using a single ballpark effect for runs.”

So, could the 2b problem have to do with the idea that many good HR parks are not good for 2b?


#5    Tangotiger      (see all posts) 2008/07/14 (Mon) @ 14:15

Excel gives you the significance of each value.  Do Tools/DataAnalysis/Regression

I remember I looked at this a few years ago, and posted the results here.  It was some very wide range, like 0.66 runs +/- .20 runs 95% of the time.

You can also select your dataset carefully enough (with hundreds of lines of data) to get a run value of the triple to exceed that of the HR.  Which is more than ridiculous.

The point is that the Run Expectancy and Markov model work far far better than any regression analysis could hope for.  Aggregating data at the team name level to reduce sample size and introduce bias is not something that an academic would choose to do.. and yet, they are doing it.

The run value of a double is around .30 from a single, and any model that shows otherwise has decided to become a slave to regression, and is worthy of whatever scorn you wish to heap upon it.


#6    Patriot      (see all posts) 2008/07/14 (Mon) @ 16:22

The standard deviation of the SF coefficient for my data is .145, and the coefficient is .715, so 2 standard deviations in either direction would be (.425, 1.005) (assuming that this is valid math).  So that would not include the empirical value in the range.

The double has a coeff of .642 and a stdev of .043, so that would be (.555, .729), with the empirical value outside the interval.

To kind of echo what Tango is saying, regression is such a lazy way to go about looking at a problem like this, when it is relatively easy to make a good model (and even if you want to disagree with the characterization, Lindsey, Palmer, Smyth, et al have already done it for you). 

The reason I focused on regression in my post was not so much to rip it, but as to illustrate the problems with RMSE, Average error, etc. as the sole deciders of the value of a run estimator.  Since the regression by definition will be the most accurate model, the logical conclusion of the “lowest RMSE always wins” line of thinking is to become a slave to the regression.

My point is that there are a myriad of factors that are important; RMSE is just one of those (as are the +1 values, the logic of the model, the ease with which it can be applied, etc.)


#7    Dackle      (see all posts) 2008/07/15 (Tue) @ 02:35

Tango, just to be devil’s advocate here—if the regression results in a low run value for doubles because it is “standing in” for something unknown ("factor X"), then don’t you have to stand by the regression until you know what “factor X” is? On another thread when we were talking about batter/pitcher matchups, one poster seemed to be arguing that because he “knows” that some pitchers own hitters, that a 7-for-12 lifetime batter/pitcher matchup must have meaning. But regardless of whether pitchers and hitters really do “own” eachother, the point is that the only data you have to go on is the 7-for-12, and the result of looking at all prior 7-for-12 matchups indicates that there is little predictive value. Likewise, even if you “know” that the true value of a double is .805, the data you are using does not account for “factor X”, and so you have to take the available data at face value.

To put it another way, if MLB only tracked singles, doubles, triples, HRs and runs scored, then your regression values for the hits would be higher than standard linear weights (because you wouldn’t be accounting for walks, steals etc). In that situation, you would be better to use the results of the regression rather than the theoretically correct values.


#8    tangotiger      (see all posts) 2008/07/15 (Tue) @ 07:09

Dackle (or advocate-Dackle), I would argue first that we should present two pieces of data, one as +.80 for the double, and another for -.15 for the proxy-double.

However, when we process the data at the pbp level, we know exactly the value of the double, and we get the same value when you create a full Markov chain.

The -.15 value for the proxy-double occurs ONLY after aggregating the 6000 PBP records into a single seasonal line for the team.  And only when restricting the dataset to a certain period of time.  So, the .15 represents some sort of sampling bias.  If you want to argue that you need it at the team level, to counteract the bias, ok. 

But, for anyone to argue that we need to introduce the team sampling-bias proxy at the player level?  That is to be a slave to the regression and makes no sense.  Above all else, we need to be rooted in logic.

Regression users deserve the full scorn from those who wish to say they should get their noses out of their spreadsheets.


#9    Dackle      (see all posts) 2008/07/15 (Tue) @ 12:32

I guess I just have a feeling that if one or two additional stats were introduced into the regression, such as extra bases taken on hits by baserunners, then the double would be worth .80 (ie the theoretical value).

If that’s true, then wouldn’t it make sense to use .80 if you have the “extra bases taken on hits” stat available for players, and .65 (or whatever the regression value is) if the “extra bases” stat is not available (ie you’re just using generally available stats)?


#10    Tangotiger      (see all posts) 2008/07/15 (Tue) @ 12:52

Running a regression on the 1878 team-seasons from 1919 to 2007, I get an r=.97, with the following coefficients:
0.56: single
0.80: double
1.50: triple (standard error = .05)
1.40: homerun
0.37: walk

The triple can be standing in for the stolen base.  The single could be standing in for reaching on error.  The walk can be standing in for the hit batter.

Who knows?  Who cares?

I’m not going to be a slave to these numbers, even if it best-fits me. 

And what if I take some other time period, say, 1969-1992 (608 team lines)?

r=.97
0.50: single
0.76: double
1.24: triple (standard error = .10)
1.43: homerun
0.36: walk

Well, which value of the single do you use?  What about the triple?

See where I’m going here?  What regression coefficients do I apply to the 1992 Expos?  Someone is going to suggest using seasonal lines within 10 years, meaning 1982 to 2002.  How does that make it any better?  Especially since baseball changed in 1993.

The whole thing is just so much b.s.

Regression as the final step = lazy or clueless analyst.

Someone feel free to find me any sabermetric research that proves me wrong.


#11    Rally      (see all posts) 2008/07/15 (Tue) @ 13:31

The reason triples have such a high value going back to 1878 is because they correlate highly to reached on errors. 

I don’t know if that’s true for modern teams, but teams in the deadball era hit a lot more triples than they do today, and scored many more runs given their h, w, hr, etc. inputs because errors were so frequent.  That’s what the regression is picking up.


#12    Dave Smyth      (see all posts) 2008/07/15 (Tue) @ 17:39

I am no statistician at all, but I don’t see that Tango’s valid points means that regression should not be used. It is a powerful tool. It seems to be the case that most of the problem is not with regression per se, but with how it is applied. So, apply it not to the the team-season data, but instead to the individual inning level. IOW, try to get rid of the inappropriate stuff that causes problems, and to leave behind the best unbiased high sample size data, and THEN apply the regression.

Done well enough, that MIGHT be the best way to find the best run values for, say, a basic estimator. To separate out the TRULY hidden information from the other stuff.

As I mentioned a few posts ago, M Schell did some sort of park adjustment which seemed to pretty much get rid of the 2b problem, even though he is still using the team-season data.


#13    tangotiger      (see all posts) 2008/07/15 (Tue) @ 18:26

I said that regression can be a starting point, not the end point.  Too often, the analyst/academic is a slave to the regression, not even acknowledging the bias in the data, only concerned about minimizing RMSE.

As Patriot so well showed, if you want to minimize RMSE, you will end up with a ridiculous value for the double.  Or in my case, the triple.

Regression is a tool, and that’s as far as it should go.


#14    David Smyth      (see all posts) 2008/07/15 (Tue) @ 18:59

-----I said that regression can be a starting point, not the end point.”

And I was saying that, if all of the prior work, selection, and adjustments are done properly, maybe regression can be the endpoint.

You have often pointed out the difference between a stat/method, and its implementation. The regression thing seems to be mostly an implementation problem.

No?


#15    Patriot      (see all posts) 2008/07/15 (Tue) @ 19:47

I believe that it is much more likely that, rather than finding hidden information when you do an inning-level regression, you will find that the results are similar to what we know to be true about the values of the events.  That is not to say that it is not worth taking a look at; I’m just predicting that you will find little of value.

And the “ecological fallacy” may rear its head at some point.  Even if team data (even on the inning level) implies the existence of hidden information, it is still a leap of faith to apply it to the individual.


#16    Brian      (see all posts) 2008/07/15 (Tue) @ 20:36

I don’t think that using regression in general is the problem. It’s the modeling and interpretation that causes misleading results. Part of the problem is that there is a proliferation of software that can allow any fool (like myself) to dump a bunch of variables into a black box and get a result.

Regression is inappropriate for a deconstruction of runs in baseball. The reason doubles are only appear slightly more important than singles in scoring is that teams that hit lots of doubles also hit lots of singles. Many of the same skills that lead to doubles also lead to singles. The crux of the issue is that singles, doubles, etc. are really just intermediate outcomes toward scoring. The true underlying skills are things like power, pitch selection, bat speed, coordination, and running speed. Both singles and doubles are proxies for those underlying skills.

There is a similar issue in football research, and regression is used frequently. Most football scoring models will include passing stats, running stats, *and* 1st-down-rates (or 3rd-down efficiency). First downs are intermediate results between running and passing skills and scoring and do not belong in the model.

100% of sac flies turn into runs, so a regression would be bound to provide a biased coefficient. Sac flies really just a result, not an input. They’re a subset of the dependent variable, which makes it invalid in as a regression estimator. It’s the model, not regression in general.

For a valid regression model, you’d want to start with something like hits and walks, factor in variables that advance runners to 3rd base, then add pop-fly outs. Of course, pop-fly outs rarely lead to runs so its coefficient would be near zero. But if you add an interaction variable of pop-flies*runner-advancement, you’re on your way to a logical model.

I think David Smyth makes the point more elegantly above.

Regression can be used as a ‘final step’ as long as a researcher does his homework in the preliminary steps. Say we’re trying to estimate a pitching prospect’s probability of success in the majors. We can use variables of his tools (velocity, movement...), height, health, handedness, etc, and estimate the weight of each factor based on a historical sample. Sure, more research could be done on pitching prospects using other methods, but that doesn’t make the regression result any less of a valid endpoint than any other method. Indeed, in this case it may be superior.

Pointing to the fact that regression is problematic as a run estimation method as evidence for its general inappropriateness for sabermetrics is a itself a fallacy. Run estimation does not equal sabermetrics.


#17    tangotiger      (see all posts) 2008/07/15 (Tue) @ 21:54

I think we’re on the same page here.  My basic concern is that the regression is used as a crutch, with its limitations completely ignored, allowing the analyst to become its slave.  It could be used as a great tool… but I haven’t seen much if any.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jan 09 16:41
Sabermetric Moves of the 2009 Pre-Season

Jan 09 19:56
Modeling Baseball Player Ability with a Nested Dirichlet Distribution

Jan 09 18:08
Line Drives

Jan 09 18:04
Challenging Nate Silver (and all other forecasters)

Jan 09 17:31
Cheers

Jan 09 17:14
Teaching sabermetrics at school

Jan 09 16:51
The first Hardball Times Annual available for download!

Jan 09 14:44
Vote for the Worst Player in MLB

Jan 09 12:29
Clint Eastwood is Archie Bunker

Jan 09 12:16
Mailbags on Parade