Wednesday, March 10, 2010
Open Letter from Cory Schwartz
This letter is in response to comments made at this thread:
http://www.insidethebook.com/ee/index.php/site/comments/pitchf_x_tools/
***
Mike, I’d like to address some of your comments regarding the MLBAM pitch classification engine. “Crappy” is a stronger critique than I think appropriate, but we do recognize that it’s not where it should be. However, to suggest that we’ve sat on our hands with what we’ve built is misinformed and incorrect.
We’ve treated this—and always presented it—as a work-in-progress. Along the way we have taken several changes to improve our classifications since first rolling out with a simple, two-pronged neural net, one for lefties and one for righties:
1. Added pitcher-specific scaling for velocity to better differentiate fastballs from changeups, etc.;
2. Added biasing into the classifications to better reflect pitcher-specific repertoires;
3. Implemented an entirely new and much larger set of training data, which we used to add a second hidden layer to the NN;
4. Tweaked (and continue to tweak) the input parameters of the NN to improve our differentiation of 2-seamers vs. 4-seamers, cutters vs. sliders, and other similar pitches.
At each step of the way when we’ve made changes, we’ve taken time to evaluate the results, determined next steps, then built and implemented further changes. The pace of change may not be rapid, and is admittedly slower than we also would prefer, but we have never stopped working on this in the background even if the results have not always been publicly visible.
For this season, we are currently testing fully customized neural nets for each pitcher, as well as new tools to more easily correct pitcher-specific repertoires and individual pitch-by-pitch classifications on a postgame basis. Both of these should be implemented soon after Opening Day, if not sooner. Once we implement these changes we will re-classify every pitch in our database based on the new custom NN’s, then evaluate the results and move forward as mentioned above.
Remember also that classifying pitches in real-time - for every pitch thrown, every game - is not the only challenge we face (and one you recognized in post #31); we are also limited by the ability to correctly define each pitcher’s unique repertoire, and to get accurate classifications to use as training data for the neural nets. We’ve enlisted the help of all 30 clubs, as well as from you and perhaps others on this thread, in collecting classification and training data but we’re limited by the accuracy of the source data. This has been a major effort on our part and continues to this day, and we’d be eager to see the results of any community-based results on this front.
In addition, our responsibilities to the Pitch-f/x system go far beyond pitch classification. As you can probably imagine this is an expensive and resource-intensive system to operate and maintain in 30 MLB ballparks, so our attention can’t always be focused on pitch classifications or any other specific issue, as much as we might like it to be. That the research community has been exploiting this data and has generated some amazing research from it is an unexpected benefit, but not one that we can allow to influence our overall objectives or priorities. As for Bloomberg Sports, they can defend their own products, but the suggestion that we’re not trying to improve our data because they are licensing from us is not only incorrect but completely counterintuitive. On the contrary, we have every incentive to improve our classifications—and ALL of our data—to make sure we are provide the best possible product to a major partner, which will in turn enable them to find better acceptance in the marketplace for their products.
I’ll leave it to Ross to address the relative strengths and weaknesses of the neural network approach, but I did want to address some of the critiques of how we’ve managed this from a business standpoint.
Thanks,
Cory Schwartz Director, Stats MLB.com


Cory, a lot of your rebuttal (or however you want to term it) to my comments is fair.
Let me state up front that I did not mean what I said as a general critique of MLBAM or MLBAM stats. There is a reason that MLBAM is widely recognized as the most successful online presence of the major sports leagues. There is a reason that MLB.tv and Gameday simply kick butt. You guys do good work. And by that I mean MLBAM in general and you and Ross in specific.
Like my criticisms of Baseball Prospectus a week or two ago on this site, I think they were fair, but my original comments were made in a very specific context and did not include the whole of my feelings or experience. I have a very positive overall view of MLBAM and MLBAM stats. Within that context I have a very few disagreements or bones to pick or whatever.
I think your general approach to improving the quality of the data is very good, and a lot of it goes unseen. Particularly your work at improving the input of the stringers and getting well over 95% (and increasing) of the pitch data reported is very important and very good.
I appreciate your openness with the analytical community, and I think that benefits everyone, MLBAM and the clubs included.
I can probably think of more things that I like, but let me address some of my specific disagreements/issues.
The neural net is a poor choice for classification. It was a good place to start, and good for Ross for coming up with it, implementing it, and debugging a lot of issues with it. That was not a simple task, and he did an excellent job at it, and I don’t mean to suggest otherwise.
However, it was pointed out at the 2008 Summit that the neural net was going to be fundamentally limited in its accuracy as a classification tool. All the improvements you mention in #2, 3, 4, are simply bumping up against the ceiling that is imposed by the method. Marv White did a good job of explaining at that summit why the neural net had fundamental issues. I can go into more detail if you desire.
My issue/claim is not that you’ve sat on your hands with not improving the neural net, it’s that you’ve chosen to stick with the neural net this long.
Accurate pitch classification has huge monetary value to the clubs, and there are not very many people who are good at it. I would guess that the clubs have few people if any who are good at it, and if they have them (Dan Fox?), I doubt they’re giving that data to BAM for free. I’m not sure why anyone would give away such a valuable IP without a commensurate return.
Sure. I recognize that. That’s why I said I thought you should hire more people, not that I thought you didn’t have enough work to do and were just sitting around on your hands.
Obviously I have an interest in that, too, as I’d be happy to sell my IP to you guys. So I don’t claim to be unbiased. But I do think I’m correct.
I can’t believe that the clubs feel that way. Well, I can believe it. I know that they don’t trust the PITCHf/x data enough to use it right now. But the data could be improved to be trustworthy, and then the value to the clubs would be so high that I would think it would influence your overall objectives and priorities.
Of course, I don’t know how much of your priorities as an MLBAM organization lie with improving your product as an entertainment source to gain revenues. It would seem to me as an uninformed but interested observer that that is probably your primary focus. I don’t know where your priorities lie with regards to data collection for the use of the clubs. So maybe it’s not really your bailiwick that I’m criticizing here, but I think someone associated with MLB should do what it takes to make the PITCHf/x data useful to the clubs.
I’ll admit that my dig at Bloomberg was out of frustration at their ability to get a wow factor with nice charts while I question the ability to do meaningful analysis without much more accurate pitch classifications. There is and was no reason for me to think that this would produce an incentive for you to stagnate, and I should not have suggested such.
Btw, it’s nice to see that you still read things here.