Thursday, February 19, 2009
Intentionally using less data?
Tim Marchman writes:
while reading the new edition of the Baseball Prospectus annual, I was a bit put off by this information on their new play-by-play defensive metric:
The best PBP systems rely on highly detailed batted-ball data—a direction for where the ball was hit, some indication of how hard, and the result of the play, with the field broken down into many, many fairly small zons. That data is typically available only for the majors. To keep the majors and minors on an even setting, we’re dealing with a reduced set of data.As I understand the idea here, BP wants to make apples-to-apples comparisons between their minor league and major league defensive numbers, and so is artificially crippling the data set they’re using to derive the major league numbers to bring it into line with the less granular data available for the minors. I see the appeal, but it makes the topline numbers suspect, especially when the system arrives at seemingly wonky results like Bobby Abreu rating as a plus defender and Hanley Ramirez as a Gold Glove candidate last year. Of course even very good systems have outliers, but not every system intentionally deals with a reduced set of data. For now I’ll continue to rely on UZR and Plus/Minus, though I’ll be curious to see what people like Tom Tango have to say about the technical pros and cons of the new system.
Since Tim asked, I’ll respond.
I have no doubt that Clay wrote that. Why is that? Because this is the same thing he said a few years ago when defending using the non-PBP numbers in FRAA, to make the “apples-to-apples” comparison for all of baseball history. I disagreed completely then, and I disagree just as much now.
Why is that? Because the biases are not systematic. Presuming that Clay is using Dan Fox’s Simple Fielding Runs (SFR), then the SFR of Darin Erstad in 2000 has no stronger relationship to the SFR of Darin Erstad in 2001 than to the UZR of Darin Erstad in 2001. Indeed, I would bet that the relationship of SFR in year X is stronger with UZR in year X+1 than to SFR in X+1! (Shades of ERA/FIP discussion.) The only way for the year-to-year SFR relationship to be stronger is if there is a systematic bias with SFR to begin with.
As a good example, an early version of PMR (Pinto’s model) had a super love affair with Orlando Hudson, while UZR merely liked-to-loved him. Why is that? One possibility was that Orlando Hudson was a popup hog, and he was getting a benefit there every year. That’s a systematic bias.
But, this doesn’t apply here with SFR and UZR. There is no question at all that you always want to use the maximum data (combined with an intelligent approach, natch) and using the same methodology does not, in-and-of-itself, provide an apples-to-apples comparison.
(Hat tip: Repoz)


SOUNDS LIKE A JOB FOR BAYESIAN MODEL SELECTION!!!