THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, November 27, 2007

PITCHf/x data quality

By Tangotiger, 01:37 PM

Mike Fast reports on more data quality issues:

...Chamberlain throws Lind five straight sliders to strike him out and end the inning. Unfortunately, however, the pitch locations recorded by PITCHf/x for these pitches were mistakenly attached to the wrong pitches in the Gameday XML data.... Then the order of the other pitches is out of whack, too. The pitch labeled #1 should be #5, #2 should be #1, #4 should be #2, and #5 should be #3.

I don’t intend my notation of this example in any way to disparage the incredible work that MLBAM and Sportvision have done in creating this data set and making it available to us. For free, no less. It’s an incredibly valuable resource, and some errors are to be expected during a season in which the system was being evaluated and debugged.

I just don’t know how prevalent these kinds of errors are and when they might call into question some of my conclusions. I do know that Eric Van spotted a similar error in Josh Beckett’s data from Game 1 of the division series, as detailed in this thread at Sons of Sam Horn, post #88. The PITCHf/x data in question for that game has since been removed from the data set altogether. 

As long as you get a substantial % correct in the aggregate, that should satisfy most research needs.  Just something to keep in mind, when looking at the data on an individual basis.  And Mike is right that it’s not everyday that we get the luxury of seeing Beta results in a production environment for the general public to consume.


#1    MGL      (see all posts) 2007/11/27 (Tue) @ 20:44

Given that it is so far free, I can’t tell them what they should or should not be doing as far as accuracy, errors, quality control (QC), etc. are concerned, however…

When a large, reputable organization puts something out there, free or not free, there HAS to be some minimum level of QC.  Also, even though they are not charging for it, how much money does it generate for them (by going to their web site and by the exposure?) now and in the future?

While a random error here or there is never going to be a big deal, for research purposes I would hope that there are not too many systematic errors or blocks of errors.  That can be quite problematic.

Finally, how much trouble could it be, given their vast income and resources, to simly double check ALL data with video, just to flag data which might be out of whack, not necessarily to double-check everything about the data?

If I don’t think that the data is 99% error free and especially if I think that there are “systematic errors,” I am going to be very reticent to do some serious research using that data.


#2    Cory Schwartz      (see all posts) 2007/11/29 (Thu) @ 18:28

Hi, been a long time since I posted here… in short, we found the error causing the sequence issue described here and believe it was mostly fixed during the postseason.

It was indeed a systematic problem and we made a few changes to improve it, so we expect that data errors in this regard should be no more than one or two per game during the postseason. Of course, that’s not 100%, and we’re implementing other changes over the offseason to hopefully reach that goal.

QC on this project has been extensive, but a system like this has never been implemented on this scale and we encountered new issues all season long. Once an issue was noted, that started the repair cycle: isolate the problem, determine the corrective steps, test repairs, and implement changes in production. Due to the complexity of the system that often took weeks, and many changes we wanted to implement were tabled for the offseason.

In short, we’re acutely aware of data accuracy issues of the system and that is second only to 100% game coverage on our offseason priority list.

Thanks for everyone’s feedback here and elsewhere all season long, keep it coming!

Thanks and best regards,

Cory Schwartz
Director, Stats
MLBAM


#3    Mike Fast      (see all posts) 2008/06/04 (Wed) @ 18:38

For the other folks out there who are looking at PITCHf/x data, I’ve started tracking problems in the data here:
http://fastballs.wordpress.com/errors-and-oddities/

I would appreciate if anyone else notices something that they would bring it to my attention so I can add it to the list.

Disclaimer: This list is not intended as a criticism of MLBAM or Sportvision or the quality of the data they are sharing with us.  Far from it!  Its purpose is that the PITCHf/x researchers can make sure they are working with good data by collectively identifying the problem data for culling or correction.  If MLBAM and Sportvision can use it to improve their data, that is also a plus, and I hope it can be helpful to that end, as well.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jul 04 12:06
Mapping IDs

Jul 04 01:40
BPro Idol

Jul 03 01:39
sUZR v bUZR

Jul 02 21:15
Batting Order and the pitcher

Jun 30 07:22
NHL draft analysis and spreadsheet 1994-2009

Jun 30 04:14
The Poz goes FJM on Harold Reynolds’ a$$ - gather around the kids

Jun 30 00:11
Blogosphere Question of the Day, 06/24; OR Why should OPS die?

Jun 27 16:04
Loss aversion in golf

Jun 26 16:30
Donald Fehr

Jun 26 14:04
Barry Code