Tuesday, November 27, 2007
PITCHf/x data quality
Mike Fast reports on more data quality issues:
...Chamberlain throws Lind five straight sliders to strike him out and end the inning. Unfortunately, however, the pitch locations recorded by PITCHf/x for these pitches were mistakenly attached to the wrong pitches in the Gameday XML data.... Then the order of the other pitches is out of whack, too. The pitch labeled #1 should be #5, #2 should be #1, #4 should be #2, and #5 should be #3.
I don’t intend my notation of this example in any way to disparage the incredible work that MLBAM and Sportvision have done in creating this data set and making it available to us. For free, no less. It’s an incredibly valuable resource, and some errors are to be expected during a season in which the system was being evaluated and debugged.
I just don’t know how prevalent these kinds of errors are and when they might call into question some of my conclusions. I do know that Eric Van spotted a similar error in Josh Beckett’s data from Game 1 of the division series, as detailed in this thread at Sons of Sam Horn, post #88. The PITCHf/x data in question for that game has since been removed from the data set altogether.
As long as you get a substantial % correct in the aggregate, that should satisfy most research needs. Just something to keep in mind, when looking at the data on an individual basis. And Mike is right that it’s not everyday that we get the luxury of seeing Beta results in a production environment for the general public to consume.
Given that it is so far free, I can’t tell them what they should or should not be doing as far as accuracy, errors, quality control (QC), etc. are concerned, however…
When a large, reputable organization puts something out there, free or not free, there HAS to be some minimum level of QC. Also, even though they are not charging for it, how much money does it generate for them (by going to their web site and by the exposure?) now and in the future?
While a random error here or there is never going to be a big deal, for research purposes I would hope that there are not too many systematic errors or blocks of errors. That can be quite problematic.
Finally, how much trouble could it be, given their vast income and resources, to simly double check ALL data with video, just to flag data which might be out of whack, not necessarily to double-check everything about the data?
If I don’t think that the data is 99% error free and especially if I think that there are “systematic errors,” I am going to be very reticent to do some serious research using that data.