THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, October 22, 2009

Why I hate regression “techniques”

By Tangotiger, 12:04 PM

A reader made reference to a popular Moneyball paper by Sauer and Hakes (pdf). Boy that paper makes my blood boil. Let’s look at Table 2, column 2001. This is the equation:
ln(Salary) = 3.1*SLG - 0.13*OBP + .003*PA + 1.1*ArbEligible + 1.68*FreeAgent +.07*IsCatcherOrInfielder + 10.3

This is based on 357 players (min 130 PA in year 2000).

And here is their money quote (no pun intended):

The relative valuation of on base and slugging percentage is abruptly reversed for the year 2004, despite the inertia produced by long-term contracts. The returns to slugging are similar in 2004 to prior years, but this is the first year for which the ability to reach base is statistically significant. The labor market in 2004 appears to have substantially corrected the apparent inefficiency in prior years, as the coefficient of on base percentage jumps to 3.68, and the ratio of the monetary returns to reaching base and slugging is very close to the ratio of the statistics’ contributions to team win percentage.

Now, just because a regression equation (likely poorly constructed) shows that the OBP has virtually zero impact on salary in 2001 doesn’t mean that this equation is valid. It’s mathematical gymnastics that makes no sense.

Let’s start with something basic: .400 SLG, .330 OBP, 400 PA, outfielder, free agent.  Now, how much do you think this guy should be getting (in 2001)?  According to their model, $1.75MM.  Sounds about right for that time, I guess.  What if this guy had a .280 OBP?  $1.76MM.  So, OBP is basically completely irrelevant according to this model of 2001.  Maybe that’s true, so let’s keep going.

Now, let’s give this guy 600 PA.  He’s at $3.2MM.  Sounds ok, too.  Give him a .250 SLG and .250 OBP and 600 PA: $2MM !!  Does that make any sense whatsoever?  No! This is the kind of b.s. mumbo-jumbo mathematical gymnastical regression “techniques” that drive me nuts.  Nuts! Indeed, a free agent outfielder with 600 PA in 2000, who hit like a pitcher (say .180 SLG, .160 OBP) would make $1.65MM according to this model.

Just because you can throw everything into a regression model doesn’t mean that you should.  You “hope” that it all works out, and all the relevant parameters rise to the surface, and their all work nicely independently of each other. But, they don’t.  And this is a prime example of what NOT to do.  I have no doubt that if I go through each year’s equation I will find similarly absurd results.

As I’ve said (and shown) in the past, it is extremely easy to value players in terms of runs, wins, and dollars. 


#1    J. Cross      (see all posts) 2009/10/22 (Thu) @ 12:31

Just to be clear (as someone who has only done a little regression analysis but is nonetheless teaching a couple of students how to do it) the problem is that this study is pretending things like OBP and PA can be combined linearly while that doesn’t make any sense, right?

If their equation had things that might be combined linearly to determine value (1b, 2b, 3b, BB...) they’d likely come up with an equation that made more sense (although there would still be problems like 3b acting as a stand in for “speed” more than representing the value of an actual triple).

We’ve basically been doing things like looking at how K% from years Y-1, Y-2 and Y-3 determine K% in year Y.

We’d also be interested in seeing how things like 1b, 2b, 3b etc predict R and RBI and (as I understand it) that might be okay but trying to combine things like lineup position and team offense levels in a linear equation with 1b, 2b and 3b would be problematic.

Is that basically, right?


#2    Tangotiger      (see all posts) 2009/10/22 (Thu) @ 13:20

Right, treating PA and rate stats in the same equation the way they are doing is terrible.  Not to mention the arb and FA parameters as well.

The reason you create a model is to… model reality.  And this is how it works:

salary = playing time * (wins minus baseline) * FA_Arb_multipler

That is it.  That’s all it is.  That’s the model.  Now, you just have to figure out how to best represent each of those parameters.

I’ve done it already in my WAR model, and it works pretty darn well.  It matches reality.

When I see a model like proposed in the paper, I see how it deviates from my model.  Mine works.  If it deviates from my model, I need to see how it does it better, either by simplifying it (which is practically impossible), or being more intelligent (which is definitely possible).

I’ve yet to see a model that does either.  Indeed, the model in the paper is so outrageous, that I’m disappointed how much traction the conclusions to the paper is given.


#3          (see all posts) 2009/10/22 (Thu) @ 13:57

Kent: Mr. Simpson, how do you respond to the charges that petty vandalism such as graffiti is down eighty percent, while heavy sack-beatings are up a shocking nine hundred percent?

Homer: Aw, people can come up with [a regression] to prove anything, Kent.  Forty percent of all people know that.


#4    Guy      (see all posts) 2009/10/22 (Thu) @ 14:24

FYI, Sauer and Hakes did a followup paper using some post-2004 data:  http://pirate.shu.edu/~rotthoku/papers/anomaly.pdf.  They go beyond SLG/OBP to look at BA, BB%, and Power separately, which is an important improvement I think.  But the salary model remains basically the same, so it will still make your head hurt.


#5          (see all posts) 2009/10/22 (Thu) @ 20:08

I’m enjoying this blog while I patiently await the start of the Yankees game tonight.

I wholeheartedly agree that practitioners of regression tend to make horrible mistakes in application and graver errors in interpretation.

One must be especially careful not to extrapolate a regression model to vectors in high dimensions that are “far” (in some mathematical sense) from the fitted data vectors. Nonsense usually emerges even for good models appropriately derived.

In this case, the salary of 1.65 M attributed by the model to the “free agent outfielder with 600 PA in 2000, who hit like a pitcher (say .180 SLG, .160 OBP)” is weird not because the model is wrong (which it very well might) but because such outfielders don’t exist in practice. For who would let such a player bat 600 times? These players only exist in theory and as such they are only wild extrapolations from which conclusions should never be drawn.


#6          (see all posts) 2009/10/22 (Thu) @ 20:23

One that that is so easy to forget, even for those well trained in statistics, is applying a regression model to data that is beyond the scope of the data used to create the model. You can never treat a regression model as something that is valid for all time and space unless it was modeled over a sample representative of all time and space. Of course, no model could ever do that.


#7    Mitch      (see all posts) 2009/10/22 (Thu) @ 20:28

While these criticisms regarding the bastardization of plate appearances are certainly valid, I concur with the point from #5 and #6 above: you can’t extend the regresssion beyond observable data points.  Suggesting an outfielder who hits like a pitcher would get 600 plate appearances simply isn’t plausible.  Well, unless you employ Jeff Francouer.


#8    Brian Cartwright      (see all posts) 2009/10/22 (Thu) @ 20:35

Any player peforming at or below the baseline will contribute zero, so regardless of playing time, the product is still ero, and he shouldn’t be projected to make more than minimum salary, unless he is named Francouer.

So, it is real life that there a few very unproductive players who are overvalued, at least for a few seasons.


#9          (see all posts) 2009/10/22 (Thu) @ 21:18

Q:  What’s the most dangerous thing a math teacher ever said?

A:  A function can be written to describe every data set in which for every X value there is one and only one Y value.


#10    Guy      (see all posts) 2009/10/22 (Thu) @ 22:36

I agree that the model’s failure to properly value nonexistent players isn’t necessarily evidence of a problem.  But in this case Tango’s example does highlight a real problem, because if you plug in the stats of real replacement level outfielder (something like .300/.400) you’ll get a salary of around $3 million, about 10x what it should be.

The models just don’t value either OBP or SLG nearly enough.  According to the multi-year model, gaining 150 pts of SLG and 150 pts of OBP combined only increases a player’s value by about 75%.  Is a 1.000 .OPS player paid just 75% more than a .700 player?  Obviously not.


#11    Tangotiger      (see all posts) 2009/10/22 (Thu) @ 22:47

Yes, my point was not that it breaks down at the very extremes.  I always say that you can’t extend an equation beyond the data it was based on (hence my disgust for using team data to test runs created, and then apply RC to players).

No, the point is how far I haev to go to find an equivalent to a .340/.400 player with 400 PA, if I give the other guy 600 PA.  Basically, I can’t even find a guy so bad that he would be valued worse than the .340/.400 400 PA guy.  It’s not that it breaks down at the extremes, but it simply breaks down period.

In WAR parlance, it would be very easy to find an equivalent for .340/.400 with 400 PA for a guy with 600 PA.  I can’t do it with the model presented in the paper (for the 2001 year).

The idea that you can add PA as an extra parameter like you would add a rate stat (!) is preposterous to say the least. 

I can find a ton of real-life absurd results from each of the equations they posted for each year.


#12    Guy      (see all posts) 2009/10/22 (Thu) @ 23:14

I’m not sure including PAs is a problem.  Because log(salary) is the dependent variable, an incease in PAs effectively multiplies the value of the hitting stats, as it should (though it’s not linear, which it should be). 

However, it’s not clear why increases in SLG or OBP should have a muliplicative impact.  100 points of OBP should have a fixed value, but in this model it’s worth a bit more coming from a high SLG player than a low SLG player. Anyone know why economists like to use log(salary)?


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 15:37
What sabermetrics is NOT

May 25 15:28
Largest demonstration in Canadian history?

May 25 15:12
Do pitcher’s reach back for velocity when needed?

May 25 15:02
Pete Palmer’s new book: Basic Ball

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion