Friday, October 12, 2007
Is MLB’s pitchf/x system accurate?
First of all, another great article and analysis by Dan Fox at BP.
As an aside, he had a great quote at the end of the article (which has nothing to do with the pitch data). People ask me all the time, “Who do you think is going to win the (insert award/series/etc.)?” Like who do you think is going to win the Indians/BoSox ALCS? Even after painfully analyzing the series for several hours, my pat (and factually correct) answer is, “I have no (bleeping) idea!” To borrow a phrase (again) from Bill James, “I am an analyst, not an oracle.” I can tell them the percentage chance of winning I think each team has (based on my model and my analysis), but I cannot tell them “who is going to win (obviously).”
Now let’s say that I had Boston at 65% (which I don’t). If they win, was I “right?” If they lose, was I wrong? What about if I had Boston at 51% (which I do) and they win? Right or wrong? Heck if I know the answers to those questions. I DO know, for example, that if I were a perfect modeler and knew the exact percentages in all baseball series or even every one of the 2,430 games during the regular season, and all my “rights” were when my favored team wins and all my “wrongs” were when my favored team lost, I would be “wrong” a heck of a lot!
As I also like to say, “If a good - no a great - analyst isn’t wrong a heck of a lot, he is probably cheating.”
Anyway, here is what Dan wrote about the difference between probabilities (which is all an analyst can do) and predictions (which are silly and meaningless for an analysts to make):
A subtler but related point in this vein is that some seem to think the models used to discuss events are necessarily predictions and therefore take a “told you so” approach when the end result seems improbable according to the model. But probabilities are not predictions, and so in addition to the fact that the models used to generate the probabilities are incomplete, even events that are unlikely do in fact happen. Only if you could replay the event hundreds or thousands of times could you say with confidence that the model is not useful.
Back to the pitch f/x data…
With respect to a pitch’s decceleration from start to finish, there is only ONE thing that determines that rate (given the same initial velocity, the same wind speed, and the same types of pitches) - the density of the air. No problem there.
Now, there appears to me to be only 3 determinites of air density: One, temperature, two, humidy, and three, altitude. I THINK that by far, altitude is the most important, then temperature, and then humidity.
Given that, it should be easy enough to predict the approximate order of the parks with repsect to pitch decceleration. Or at least which parks will be at the top and which ones at the bottom.
If you look at Dan’s list, however, there does NOT seem to be the order you would expect, as Dan points out (e.g., Comerica Park is both cold and low in altitude). If you look at the pitch break lists you see some equally funky rankings of parks.
Either the pitchf/x data, at least with respect to pitch decceleration and break, is VERY innaccurate (and/or there are biases among parks), or there is a lot of unnaccounted for and different wind patterns in these parks.
Smooth Jimmy Apollo: Well, folks, when you’re right 52% of the time, you’re wrong 48% of the time.
Homer: Why didn’t you say that before!
***
I would hope, that it’s a given, that all predictions are made with the understanding that the predictor is expecting to be right 51-60% of the time, unless otherwise noted. There can’t be that many real-life Homer Simpsons, can there be?
Then again, if that’s true, if predictions are made with the expectation to being only a bit over 50% correct, why in the world do we care about individual predictions of individual people?
I also give a similar answer to MGL: I don’t know who will win, and no one else does either. All I want is a good game.