Sunday, March 07, 2010
PITCHf/x Tools
Dave gives us his list.
Buy The Book from Amazon
Looks excellent!
Phil Hughes really added some serious velocity when he moved to the bullpen.
At some point I will add more. I was thinking of percent pitch selection vs RHB and vs LHB. I probably won’t put all the spin/break categories up there (I haven’t even grokked them yet). But I’ll select a few that are most interesting, like the horizontal break on a slider. I’m trying to give an overview for what most people want to look up for pitch fx data. Suggestions are welcome. But I definitely won’t be putting everything on there, I think that would be overkill. I’m planning on adding links to other pitch fx resources at the bottom, Dave’s article is a good source.
IMO, Fangraphs and TexasLeaguers cover the spectrum pretty well of what is useful given they are working with the crappy MLBAM pitch classifications.
I’m pretty sure Fangraphs is not using MLBAM, they are getting the data from BIS. One thing I noticed yesterday is that Fangraphs shows Tim Wakefield mixing in 10% batting practice fastballs, MLBAM classifies almost every pitch he throws as a knuckleball.
Rally/5, it depends on which data on Fangraphs you are looking at. Their “pitch type” and “pitch type values” data on the pitcher season stats tab come from BIS. But if you click on the PITCHFX tab you get the PITCHf/x data and pitch classifications from MLBAM.
Wakefield is a good example showing the poverty of the MLBAM classification system. It says Wakefield throws 99% knucklers, which is not correct. The BIS data looks pretty close to correct for Wake. In general BIS does much better with pitch classifications than MLBAM, although BIS data has its weaknesses, too.
Mike—Is there any chance for the community to help get the MLBAM pitches are classified correctly? My brother would have no problems storing the corrected dataset. Maybe a player rep for each team.
Jeff/7, it’s not a trivial task and those who can do it may be reluctant to give it away for free, but in principle what you suggest is possible.
#8 - Mike, I know the what it will take. I may just start knocking it out one pitcher a night.
The other dataset someone, which I will probably end up doing, is getting the stadium mis-calculations available for everyone to use. This is actually needs to be made available before redefining pitches.
Mike and Jeff, I think the first step we’d need to take to setting up a community classification project, with which I’d be will to help, would be to assemble a document containing links to reliable scouting reports detailing pitcher repertoires.
Jeremy - send me an email at wydiyd ~ hotmail ~ com
Jeremy, I think I mentioned this (eventually) on the other thread here where you asked, but I would be willing to help with a community classification project.
What’s the best format for this? Would it be a wiki? IMO, it would be nice to collect both full scouting reports as well as smaller incomplete reports, such as when a pitcher mentions he is working on a particular pitch. It might also be nice to collect links to images showing the pitcher’s grip for a pitch.
I would be glad to help with a community classification project as well. I think people should just post the pitcher name, pitch types and usage rates, as well as any special notes on the players.
I’ve already classified a handful of pitchers, going game by game, for various articles and would be glad to give out my info..
Mike, Nick and Jeremy—I have been wanting to set up a site similar to:
for every MLB pitcher. It was to mainly focus on mechanics (linked video, comments, ratings, etc) It might not be hard then to add the information that Mike is talking about.
We could just use a google doc (Jeremy is all over this) to set it up for initial research and then upgrade later.
Jeff/14,
Are we going to primarily be collecting links to others’ work? In which case a wiki style might make more sense.
Or are we going to primarily be cataloging our own findings? In which case a Google Doc may make sense.
If we do a lot of both, I’m not sure what format makes the most sense.
What I envisioned, and thought Jeremy was proposing also, was not so much PITCHf/x-related research or publications about pitchers’ repertoires, but rather reports from primary or secondary scouting sources. That is, either comments from the pitcher himself, or his catcher or coaches or opponents, or interviews with the pitcher, or reports from a scout who personally watched him pitch.
Images of grips can also be very useful as a primary record of a pitcher’s repertoire, but I would probably put that as a second tier because it seems a little harder to integrate that information.
I am thinking too much in the future, guess I better back up.
What we want is the database correct. To do that, we want the pitches classified correctly. Jeremy has done some work and has the pitches group for each pitcher, just not classified. I would say what is next is to find out what pitches a pitcher throws , via various methods.
A google doc (spreadsheet) would probably work best for now.
I’d like to help with this as well. A standard methodology along with a list of possible places to find the relevant information would be the first step I think. Then splitting the list of pitchers among the participants.
I would be happy to code up a searchable back and front end for the database, but while we collect I agree a google spreadsheet would work best.
Let me know!
I may be thinking about something different than what the rest of you had in mind, in which case, don’t let me derail you.
But what I would find valuable is a community-sourced repository of scouting information and references about a pitcher. This is something that can be collected over time because pitchers’ repertoires change over time.
I guess I don’t find much value in a community-sourced list of what pitches a pitcher throws. Maybe some other people do, though. At one point, maybe 1.5-2 years ago, I was working toward something like that, but I finally decided it wasn’t feasible, for several reasons. Among them are the fact that a pitcher’s repertoire changes over time, and that is not easily represented in a table/spreadsheet format, and the fact that there is some uncertainty surrounding the identification of a number of pitchers’ pitches, and it’s also difficult to capture that in a table/spreadsheet format. Then you have guys like Roy Oswalt who throws two very distinct types of curveballs, and Zack Greinke who throws a whole spectrum of curveballs. You really need to be able to classify Oswalt’s two curves separately, but I’m not sure you want to do that with Greinke. And there are all sorts of variants like that.
Now, I agree that it is valuable at some point and for quite a bit of analysis to just slap the best label you can on a group of pitches and go with that. But to me that is a secondary step after the primary step of cataloging the more detailed and expansive descriptions of what a pitcher throws, i.e., scouting reports.
Both steps are useful. I am interested in participating in the community effort for the first step, if anyone else also wants to do that. I am not particularly interested in participating in a community effort for the second step, although I certainly do not want to discourage that if others want to do so.
I think the first step, should people want to undertake it, would not be well-served by a Google spreadsheet format.
I do think that scouting reports are useful Mike. I have searched ESPN game recaps on multiple occasions looking for references to a pitch type thrown.
I think at the very least, you could have “notes” section for each pitcher that people could add to. For example, it appears that Tommy Hanson last year threw exactly 2 twoseam fastballs, both in the same game. I couldn’t find anything to collaborate that for me, and it’s possible it was a Pitch f/x glitch, but it looked very much like he was throwing a different pitch than his usual fourseamer. They had about 6 inches more horizontal movement and 5 inches less “rise”.
I think an open sourced third party scouting report on each pitcher would be great, but I’m not sure why would should limit it to just that.
On a related note, Mike, what do you think of Sven Jenkins?
I often consult his website just to see if there are any oddities in a pitcher’s selection I should know about.
Mike/18
Here’s why I’m keen on getting a list of pitches per pitcher:
http://www.phpclasses.org/blog/post/119-Neural-Networks-in-PHP.html
Basically it is a new framework that allows for (relatively) easy neural net programming. I realize that MLB have some smart folks and we all hate their classification system (based on neural net) but I have to think if we were able to compare data for every pitcher on what the community has decided he throws, we could get most of the way there (~80%) from most of the pitchers based just on the neural net.
Combing efforts with http://katron.org/projects/baseball/
and his work on replicating Kaulk could lead to a much more useful pitch f/x system.
So that’s where I’m coming from.
OMG Nick that site is awesome. Thank you for the link.
Jeff, I have classified all pitches in my database.
I’m going to send out an email to those interested. Right now, I believe that is Jeff, Mike, Josh, and Nick. Let me know if you want to help on the project.
I do think that scouting reports are useful Mike. I have searched ESPN game recaps on multiple occasions looking for references to a pitch type thrown.
I think they are more than just useful. They are crucial. But I also seem to be pretty much alone in that viewpoint in the Pfx analysis community.
I think an open sourced third party scouting report on each pitcher would be great, but I’m not sure why would should limit it to just that.
I had quite a number of sources I would incorporate that I listed in post #15. Developing a good list of resources on a pitcher is not easy work.
On a related note, Mike, what do you think of Sven Jenkins?
Read my quote on his site.
Correct me if I’m wrong, but the reason that MLBAM’s pitch classification is lacking is because they have to report the pitch in real-time.
While the rest of us can change a borderline 4-seamer to a borderline 2-seamer depending when we run our data set, MLBAM has no such luxury.
Also, while the rest of us can adjust each game, MLBAM is likely taking on alot of faith that the operators have the machines calibrated, and/or, they start fresh each game, with just the minimum of priors being used.
That is, I presume pitches in the 5th and 6th inning of a starter are better classified than those in the 1st and 2nd innings.
Am I correct?
That’s my understanding, for what it’s worth. At least the real time portion.
I still don’t think a neural net is a panacea, but I really have no expectation that folks would actually do this kind of pitcher scouting each year, so it seems like the best alternative.
Correct me if I’m wrong, but the reason that MLBAM’s pitch classification is lacking is because they have to report the pitch in real-time.
Reporting in real-time limits what you can do, but there is no reason (well, other than time+money) that the results need to be nearly as bad as BAM’s.
Also, while the rest of us can adjust each game, MLBAM is likely taking on alot of faith that the operators have the machines calibrated, and/or, they start fresh each game, with just the minimum of priors being used.
That is, I presume pitches in the 5th and 6th inning of a starter are better classified than those in the 1st and 2nd innings.
BAM does use prior info from pitchers, although, IMO, they don’t use it well. Also, there is no reason they have to assume that the camera systems are well calibrated.
BAM does a lot of things that limit their accuracy. Almost none of them are necessary. The neural net is probably the only one that’s “necessary”, and that’s only because they have a guy who’s a wizard at web programming (Ross) doing their pitch classification, and a neural net is something easy for him to pull off. It is not the best approach.
I recently classified all of Tommy Hanson’s pitches by hand going game by game, so I am 99% confident in my classifications of him. Here is GameDay’s accuracy rate grouped by inning:
Inning Count Accuracy 1 288 0.9653 2 317 0.9243 3 273 0.9267 4 292 0.9075 5 310 0.9065 6 221 0.9638 7 153 0.9542 8 8 1.0000
So there appears to be no trend for Hanson at least. Hanson, however, was pretty easy to classify (GameDay is nearly 95% on him last year), so that may not be a good test.
Mike,
What do you suggest? It may be easy skating for Ross, but it will be a bear for me. I’m not a “real” programmer.
Thanks.
Josh - I would suggest K-Means. It’s very intuitive and pretty easy to code I think. If there is some way to code it so that you could classify each pitch going game by game instead of just by pitcher, that would make at *way* more accurate. The biggest problems are park affects and the fact that pitchers change the movement on velocity on their pitches from game to game.
You could use Sven’s pitch types to create the number of partitions.
Josh/28, by “easy” I mean what my freshman physics prof called “straightforward but nontrivial”. I’m not a programmer, either. I’m a good copier and hacker.
The concepts of neural nets are fairly easy to grasp, and you don’t have to understand a lot about the intricacies of pitch classification to implement one. That’s why it was probably a good choice for Ross when he was evaluating things in the 2007-2008 offseason. However, I’m not sure those reasons still make it a good choice for BAM today. Nonetheless, they’re not willing to pay for anything more, and Bloomberg is willing to pay them big bucks for their crappy classifications and dress them up in pretty graphs, so why improve?
There are other classifiers that are more complex to implement, which takes some reading and some mathematical understanding. (Do a Google search on pattern classifiers, pattern recognition, machine learning, etc., to find some places to start. The Wikipedia articles on these topics are actually a good place to begin for a basic grasp.) You will also do better the more you understand about pitching, and that is a subject that takes a lot of continuing effort to learn. Then melding the two will ultimately get you somewhere productive.
I’m not sure that’s the question you were asking, so feel free to restate.
I’m critical of BAM because they haven’t done better with so much money and three years at their disposal. I’m not critical of Ross personally or what he has produced because frankly it is pretty neat. I’m also not critical of individuals who are trying this as a hobby. But a professional league should be able to do better than what one guy can hack together in three years. IMO, they should have either given Ross some additional resources to make his system much better or let Ross do what he does best and hire someone else to build a pitch classification system.
Yes, Nick has a good suggestion. K-means is definitely a better place to start than neural nets. K-means doesn’t work real-time, but for post-hoc it is a very good place to start.
I tend to get focused on the real-time classification problem, and for that I think you need a machine learning approach to be really successful, but that’s a different beast than post-hoc.
K-means it is!
And as an aside, a professional league shouldn’t stop commissioning an injury report, or fail to do any basic research on player injuries out of fear of embarrassment or upsetting the players union, either. And yet here we are. But that’s a different rant.
Combing efforts with http://katron.org/projects/baseball/
and his work on replicating Kaulk could lead to a much more useful pitch f/x system.
It’s certainly going to take a while. It would be useful to know number of pitches used for algorithmic classification. I was planning on devising an algorithm to guess the number of pitches but a crowd-sourced version is good.
Sean,
Assuming I get permission from Sven, I should have a full set of pitch numbers for all pitchers referenced by mlbID is a couple days. I’ll post the download on blog.rotobase.com.
K-means script is working. Currently set to grab each pitcher’s game, run to find the centroids and output arrays of the pitch groups. Missing the number of pitches per pitcher but that is coming.
What I’m currently struggling with conceptually is a way to keep pitchIDs attached while pitches go through the K-means blender.
Any suggestions would be super.
Nevermind. Got it worked out.
Josh - not everyone has the same Pitch_id in their database because they are just done by auto-increment. Can you use the sv_ids instead?
So RZ just posted a cluster in R tutorial at BtB
http://www.beyondtheboxscore.com/2010/3/9/1365592/advanced-graphing-techniques-part
I’ve been messing around with it for the last couple of hours, and as far as I can tell it seems to be very good. For Tommy Hanson, the K-Means classifications were 98.7% correct compared to my own game by game ones, and Gamedays were 93% correct. For Jarrod Washburn, who’s pitches blend so much that you pretty much need to go game by game to get it right, K-Means is 90% correct on a sample of 487 pitches and GameDay is 68% correct (ugh Mike). And Washburn is likely the hardest case you are going to get.
However, the biggest problems are establishing how many clusters there are for each pitcher. Sven’s website should provide a good starting point, but just looking at some pitcher’s, he doesn’t appear to separate fourseamers and twoseamers well. For example, for AJ Burnett, he has him throwing a fastball, changeup and curveball. However, Burnett clearly throws two types of fastballs - a fourseamer and a twoseamer (similar velocity, but drastically different movement). So I’m not sure how useful Sven’s pitch descriptions are.
So this is where the scouting/pitch quote/grip spreadsheet would be very useful.
I’m pretty sure I’ve used K-means and it was pretty good. IIRC using K-means you have to specify the number of clusters. Overall, its done well but I have gotten some weird results sometimes. I think thats because I did it as a whole and not on a game by game basis. I’m sure there’s stadiums that arent calibrated correctly so maybe game by game better.
John, Mike, Nick,
K-means doesn’t look to be much of an option for automated pitch classification going forward sadly.
However, the link posted above by Nick shows a simple method to use R to output and hand crafted pitch classifications. If we built a database of those together as a community I could use it to classify new data by matching pitches that fit a known profile.
If there isn’t interest I’ll do all the pitchers I have data for myself as the year progresses.
The problem with K-means is that the output must still be mapped to a pitch, since K-means or any algorithm I can construct will not do a very good job at automatically assigning pitch types to array groups.
As usual I’ll make any work I do available for download.
And Corey, if you’re out there - if you find the work useful, please by all means take it, with my blessing.
Aug 31 15:28
Fans Scouting Report: Update
Sep 02 14:59
Roger Federer
Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are
Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?
Sep 02 14:49
Mail: rWAR v fWAR
Sep 02 14:15
WOWY Teachers
Sep 02 13:37
Who’s Waldo?
Sep 02 08:36
Team Elin
Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?
Sep 01 23:16
Strasburg II
Time to add one. Here’s Roy Halladay’s page:
http://www.baseballprojection.com/2010/pitchfx/pitcher136880.htm