Tuesday, November 17, 2009
Going through the Sauer/Hakes Moneyball data
Sauer and Hakes were kind enough to give me their dataset. As you know, I took exception to their findings, because I don’t believe their model captures the reality of baseball. That while they did a great job in identifying the parameters to study, the way they combined those parameters left alot to be desired.
As I noted earlier, in the 2004 equation, it implies that two players with equivalent hitting stats will be paid more in the outfield than the infield. This makes no sense of course. Positional scarcity (which comes about because the fielding talent is greater in the infield) would demand that the infielders are paid more for equivalent hitting. A baseball fan will accept this as true. A GM will pay for this as true. A model that doesn’t reflect the reality of what we know by our guts and what actually transpires on the field has shortcomings.
So, what happens in the 2004 data? Using their data (setting filters on year=2003, freeyear=2004, free=1, inf=1) we have 16 infielders that became free agents for the 2004 season. The most prominent of these players is Miguel Tejada, who we know signed a huge 6/72MM deal with the Orioles. This deal, however, was backloaded. His actual payout for the 2004 season, using the Hakes/Sauer data, is 4.8MM$. Therein lies the problem. It’s irrelevant what he was actually paid for that season, since he did sign a 6-yr deal. The quickest way to note his salary is to just give him each year at 12MM$. This makes an enormous difference.
(Joe Sakic, in the prime of his career, once signed a front-loaded offer sheet from the NY Rangers for 3/21, with 17MM being payable in the first year. We’re not going to sit here and say that we should count his salaries as 17, 2, 2, are we?)
There were 3 other infielders who had similar hitting stats to Tejada: Luis Castillo, Joe Randa, and Mark Grudzielanek. Let’s look at each of these guys.
Grudz, according to their data, was a free agent entering the 2004 season. In 2003, for the Cubs, he earned 5.5MM$. He was still with the Cubs in 2004, but for 2.5MM$. In 2003, he had one of his best hitting seasons, and even got a couple of MVP votes (at the low end). In 2004, he played 81 games. I can only presume that Grudz was injured, and so, the Cubs got the injury discount.
Joe Randa was 34 years old.
Luis Castillo signed a 3/16 deal starting in 2004. The Sauer/Hakes data shows him with a 4.7MM$ deal (which is not much different from 5.3MM$, but still works against him, and infielders). Castillo, as we know, is an onbase machine. Plus a good fielder. And yet, all he got was a 3/16 deal? His OPS+ in the three preceding seasons (2001, 2002, 2003) were: 81, 95, 106. But, the Sauer/Hakes model only looked at his 2003 performance (106). So, it looks like he’s as good a hitter as Tejada, when in fact he was not. Tejada’s OPS+ from 2001-2003: 109, 128, 111.
The fifth best hitting infielder was probably Todd Walker, and we know he’s a disaster as a fielder. So, if he signs a low contract, we know that teams are valuing based on that fact.
Then there’s Greg Norton, who signed a 600K deal, even though his OBP/SLG were a bit above league average and he was a 3B. How does that happen? Well, Norton played in Coors in 2003, so his performance was actually pretty bad as a hitter. And, he was a backup. Now, they do include a PA parameter, but that’s not enough to undo this damage. Their model has him at 1.2MM$, and he was paid half that (barely above the minimum level). Indeed, by their regression equation, it is virtually impossible to fit the data to get a 600K free agent salary.
I took the 6 infielders with the best hitting stats in 2003, and I was able to explain an anomoly for each one. The end-result is what the regression equation told us: infielders should get paid less than outfielders for the same offensive performance. While that, inandof itself, does not necessarily damn the study, it points to bias or data anomoly. The Tejada salary misuse is fairly damning. We have pointed to many other issues in past threads as to scratch our heads at what the regression equations are saying. Overall, the results simply point to claims that do not seem reasonable, and the supporting evidence has enough questions as to invalidate the study.
***
I was listening to NPR on my drive home, and I missed most of the inteview, but from what I gathered, they were talking about the recently deceased James Lilley, U.S. Envoy to China during Tiananmen crackdown. And they recounted a story as to how someone told him “your data is wrong” and his response was “then I have to retract my opinion”.
Hopefully, we’ve given Sauer/Hakes enough to think about here as to question their data and methodology.
***
UPDATE: I decided to look at the 2001 data, which is the one where the OBP parameter is negative.
I set the filter to year=2000, free=1. I expected to see all the free agents. That’s not what I found. This “free” parameter is not valued corrected. For example, Barry Bonds’ contract expired following the 2001 season (free agent entering 2002). But the “free” parameter is set to 1 for every year in his data set (1999, 2000, 2001, 2002, 2003). So, it can’t be used as a stand-alone like that. I don’t know if this impacts their data. I have to also set the freeyear=2001 in order to get my list.
This list is a total of only 13 players, which seems awfully small to me. Maybe something happened in 2001, I don’t know. But, most importantly, this is the year that ARod signed his contract! Talk about an outlier to end all outliers, and then need a regression equation to try to fit that data point, along with free agent Jeff Reboulet who signed at close to the league minimum. Reboulet had an OBP of .325 and ARod of .420. Of course you can’t give too much weight to the OBP. How could you? Plugging ARod into their 2001 equation, and you get an estimated salary of 8.4MM$. Ouch. Reboulet, he of the lower SLG than OBP (hard to do) comes in with a forecasted 762K$ salary to compare to his 450K$.
Regardless of whether I’m misreading the data regarding the counts, I do know that ARod played for the Rangers in 2001, that means that he’s part of this dataset. And since their regression equation seems out-of-whack, and “coincidentally” it also happens to be the largest monster contract of all-time, well, that pretty much explains that, doesn’t it?
***
Here’s their dataset, which they authorized me yesterday to release, but I missed that email somehow. Now, you guys have a crack at it: moneyball-data.xls, and in stata format: lewis9903d.dta.txt (Note that after you download, rename the file to remove the .txt.... it was the only way the software let me upload it.
At the request of the authors, they want to point out that the data is sourced from Lahman DB and Pappas salary data.


Recent comments
Older comments
Page 1 of 320 pages 1 2 3 > Last »Complete Archive – By Category
Complete Archive – By Date