Wednesday, June 11, 2008
Technique for clustering pitches from PITCHf/x
Someone asked, so feel free to use this thread for that sole purpose…
Buy The Book from Amazon
Someone asked, so feel free to use this thread for that sole purpose…
I’ve downloaded the data, and I’ve parsed 2007, but I haven’t done any more than a few test queries attempting to validate that everything’s working and I know what I’m doing.
From what I’ve seen on various sites, discussions here, and specifically the one about how MLBAM was designing the neural net to classify pitches in real time, I would
1. Find the fastest speeds for each pitcher. This should be a fastball and be consistent with mean of all pitchers in horizontal movement of a fastball. This is the ceiling, and can go about 5 mph down.
I would like to have an algorithm that can run as part of the parsing, or as a sql query, so that my database is populated with the pitch types from the beginning.
2. Find the slowest speeds for each pitcher that have the same horizontal movement as their fastball. This should be a changeup. This is the floor, and can go about 5 mph up.
3. Find for each pitcher the slowest pitches with the most break, both vertical and horizontal. This should be the curve.
4. Find for each pitcher the pitches that have the same horizontal break as their curve, but more speed and less vertical drop. This should be a slider.
This covers four basic pitches, without going into 2 seam, 4 seam, etc. It would need tweaking for some individual pitchers. (12-6 curves eg)
Up to #3, I’m with you all the way. I think this is exactly how I would do it (though, as others will point out, we need to calibrate the numbers on a game-by-game basis because of equipment setup).
For sliders and sinkers and cutters, I’d probably have to study it more.
And of course, you need to preflag the knuckle pitchers.
Actually, not necessarily horizontal movement, but possibly spin axis and rpm.
It might be best to preflag screwball pitchers as well, else they could be called curve balls in Brian’s post above. Josh’s algorithm is calling them change ups for Danny Herrera, but if you look at the graph they are clearly distinguishable.
I’m thinking it’s probably necessary to set up a many-to-many table linking pitchers with types of pitches they throw, and then in the pitching table list characteristics of the type of pitches.
I’ve been spending time staring at Josh’s pitchers cards to see the patterns that distinguish each pitch type.
Mike Fast had a really cool chart, which Dan Brooks I think then used in his wiki. Search my blog, and you’ll find it.
Here you go:
http://www.sonsofsamhorn.net/wiki/index.php/Pitchfx#It.27s_All_About_the_Break
Ha. I totally missed this thread, even after asking for it. We’re dealing with some insane floodwaters in Iowa City right now.
I’ve set up the subscribe feature for it and I’ll see if I can’t get some others who do this involved.
And, there seem to be some problems with doing this with hierarchical clustering, namely that there is so much between-pitcher variability. But, I am not an expert on that kind of method, I just know a little about it.
1. Find the fastest speeds for each pitcher. This should be a fastball and be consistent with mean of all pitchers in horizontal movement of a fastball. This is the ceiling, and can go about 5 mph down.
I would like to have an algorithm that can run as part of the parsing, or as a sql query, so that my database is populated with the pitch types from the beginning.
2. Find the slowest speeds for each pitcher that have the same horizontal movement as their fastball. This should be a changeup. This is the floor, and can go about 5 mph up.
3. Find for each pitcher the slowest pitches with the most break, both vertical and horizontal. This should be the curve.
4. Find for each pitcher the pitches that have the same horizontal break as their curve, but more speed and less vertical drop. This should be a slider.
I’m on board, and I think you could do this for a lot of pitches.
I’d actually start with a fastball and curveball - if the pitcher throws it, these will be by far the two easiest pitches to identify. Of course, if a guy throws a fastball and a slider, than that slider is going to look tantalizingly like a curveball to this type of analysis in the absence of a curveball to anchor it (if this makes sense).
But the biggest problem with rules like this is that, for the most part, it’s not the simple pitches, like fastball and curveball, that give algorithms like the MLBAM algorithm trouble. We don’t really need to improve upon the identification of those pitches. What we need is to improve upon the identification of everything else, where changeups are being called fastballs and cutters are randomly dropped into whatever category wins a coin toss and every fastball is just a fastball.
For example, identifying a cutter with rules like this is an absolute nightmare, especially when the same guy throws a slider, a curve, and 2 fastballs. Which one of the clusters is a cutter? What are the rules for identifying it? Identifying the change in the way you describe will probably also mislabel a lot of changeups as sliders or cutters (even though they are really straight-changes). And identifying the slider the way you describe will work great, except when the guy throws no curveball to compare it to.
Do we need a repertoire for each pitcher to make a good guess at what he’s throwing? I feel like we do. But even that’s tricky. Repertoires change - look at the much dissected Joba Chamberlain, who threw what really look like 3 2seam fastballs in one start and then threw 3 changeups the next. I’ve been doing some analysis of Jon Lester over at SoSH and he’s finally got back to throwing 2 pitches that he used to have in the minors but didn’t throw last year. If we used his old repertoire to classify his new pitches we’d be toast.
Honestly, I’m not trying to be critical or annoying, so I hope I don’t come across this way. =) I’m not trying to be a pessimist. But while the “rule based” approach (rather than the Neural Net or Cluster analysis approach) has great strengths, I also feel like it has great weaknesses. Maybe a combination of the 3 would be best.
-Dan
Dan - thanks for the comments.
I have the data, programming skills, but no algorithm, although others do.
I threw those rules out there as a starting point. I am hoping that this thread is where we can hash out some ideas.
I am not familiar with Neural Net or Cluster Analysis. Are you able to describe how they work and would be programmed?
Brian, since I haven’t implemented an algorithm, I can’t really give you any direct pointers. But in all honesty, a good place to start with any of these algorithms is wikipedia. They have a pretty good run down of several clustering algorithms.
Sfter studyinh the Wikipedia articles, I believe Neural Networks are still just a little bit above my level of comprehension, but on the other hand I think I have a pretty good grasp of Cluster Analysis.
The main thing I got out of the reading on Cluster Analysis is that clusters are defined in n dimensions by the centroid (center) of the cluster, and the distance each point is from the centroid, which can be represented by the mean and standard deviation.
If we use as parameters
1. Speed
2. Horizontal movement
3. Vertical movement
4. Spin
For each type of pitch (and by each pitcher) we can calculate from the known pitches already recorded the mean and variance of these four parameters. For each new pitch, see how many SDs that pitch is from each pitch type cluster. This would be very much like doing lists of comparable batters based on different component stats.
Any new clusters created by new data, or just unclassified pitches could then be reviewed manually to determine the identity (hanging curve?) and then “teach” the system accordingly.
I think this can be done in SQL, with a table of pitch types containing a description of each, a table of pitchers, and a many-to-many join table between the two where each pitcher’s pitch type can be described.
MLBAM’s system is set up to recognize the first pitch of the season with no knowledge of an individual pitcher, although we have nearly a full season of past data available to batch process and “train” the sytem for each pitcher.
On the parameters, aren’t speed, spin angle, and spin magnitude all that are needed? That pretty much captures essence of each pitch. Release point could be looked at too, as some pitchers use different release points.
I’ve been trying to figure out where to jump in on this thread. I have a lot of different thoughts on pitch classification. The main one probably is that it is a much an art as a science if you want to do it well. The second one is that how you do pitch classification depends on what you want to do with the data.
Brian, you should read the two pitch classification presentations that were made at the PITCHf/x summit, if you haven’t already, one by Ross Paul and one by me. They can be found here:
http://www.sportvision.com/events/pfx.html
Then there’s this article that I wrote about seven months ago:
http://mvn.com/mlb-stats/2007/12/22/can-we-classify-every-pitch/
A few parts of it are outdated, but it’s still a fair summary of what I think about the subject.
Ross Paul has made some posts here on the topic. You should track down those threads.
I have a lot of thoughts on the topic. It’s basically my main area of interest and something I’ve been thinking about for the last ten months. It’s not easy (or perhaps even possible) to distill that down into a few simple rules. My suggestion would be to try manually classifying the data for one pitcher and then a handful or pitchers. That will give you a much better feel for the task, and then you will be able to ask better questions.
Also, you have not talked about what your purpose is with the data, which makes a huge difference in what approach you should use.
Those could work too. I came up with those from looking at the charts that Mike Fast and Josh Kalk have posted, and seeing which combinations best graphically define the differences between pitches. I already have all of 2007 parsed into MySQL, but haven’t done anything with it yet. As soon as I can I’ll try to get something coded and see what works best.
Hierarchical clustering? If I had time to get the PFX database, that’s what I’d use.