Wednesday, March 10, 2010
Open Letter from Cory Schwartz
This letter is in response to comments made at this thread:
http://www.insidethebook.com/ee/index.php/site/comments/pitchf_x_tools/
***
Mike, I’d like to address some of your comments regarding the MLBAM pitch classification engine. “Crappy” is a stronger critique than I think appropriate, but we do recognize that it’s not where it should be. However, to suggest that we’ve sat on our hands with what we’ve built is misinformed and incorrect.
We’ve treated this—and always presented it—as a work-in-progress. Along the way we have taken several changes to improve our classifications since first rolling out with a simple, two-pronged neural net, one for lefties and one for righties:
1. Added pitcher-specific scaling for velocity to better differentiate fastballs from changeups, etc.;
2. Added biasing into the classifications to better reflect pitcher-specific repertoires;
3. Implemented an entirely new and much larger set of training data, which we used to add a second hidden layer to the NN;
4. Tweaked (and continue to tweak) the input parameters of the NN to improve our differentiation of 2-seamers vs. 4-seamers, cutters vs. sliders, and other similar pitches.
At each step of the way when we’ve made changes, we’ve taken time to evaluate the results, determined next steps, then built and implemented further changes. The pace of change may not be rapid, and is admittedly slower than we also would prefer, but we have never stopped working on this in the background even if the results have not always been publicly visible.
For this season, we are currently testing fully customized neural nets for each pitcher, as well as new tools to more easily correct pitcher-specific repertoires and individual pitch-by-pitch classifications on a postgame basis. Both of these should be implemented soon after Opening Day, if not sooner. Once we implement these changes we will re-classify every pitch in our database based on the new custom NN’s, then evaluate the results and move forward as mentioned above.
Remember also that classifying pitches in real-time - for every pitch thrown, every game - is not the only challenge we face (and one you recognized in post #31); we are also limited by the ability to correctly define each pitcher’s unique repertoire, and to get accurate classifications to use as training data for the neural nets. We’ve enlisted the help of all 30 clubs, as well as from you and perhaps others on this thread, in collecting classification and training data but we’re limited by the accuracy of the source data. This has been a major effort on our part and continues to this day, and we’d be eager to see the results of any community-based results on this front.
In addition, our responsibilities to the Pitch-f/x system go far beyond pitch classification. As you can probably imagine this is an expensive and resource-intensive system to operate and maintain in 30 MLB ballparks, so our attention can’t always be focused on pitch classifications or any other specific issue, as much as we might like it to be. That the research community has been exploiting this data and has generated some amazing research from it is an unexpected benefit, but not one that we can allow to influence our overall objectives or priorities. As for Bloomberg Sports, they can defend their own products, but the suggestion that we’re not trying to improve our data because they are licensing from us is not only incorrect but completely counterintuitive. On the contrary, we have every incentive to improve our classifications—and ALL of our data—to make sure we are provide the best possible product to a major partner, which will in turn enable them to find better acceptance in the marketplace for their products.
I’ll leave it to Ross to address the relative strengths and weaknesses of the neural network approach, but I did want to address some of the critiques of how we’ve managed this from a business standpoint.
Thanks,
Cory Schwartz Director, Stats MLB.com