Tuesday, June 28, 2011
Clustering and pitch/fx
Jimmy:
Let’s say you want to identify clusters in two-dimensional data. You an do this using a clustering algorithm such as k-means or soft k-means. In a nutshell, what this does is take an initial set of means (chosen however), evaluate the distance of each data point to one of the means using some distance metric and then assigns a mean to each data point (i.e. the closest mean). Then it re-evaluates the means given the current assignment and steps through the process again, unless it converges and you have the data grouped into “k” clusters.
So this helps with grouping the data points, but let’s say you wanted to go a little bit further. What you can do is run the initial algorithm to find the means and cluster assignments, and then impose the assumption that each cluster is distributed around its mean (which you just found) according to a bivariate normal distribution. Then you use maximum likelihood (ML) to find the variance parameters of the bivariate normal for each cluster, which may vary for each cluster. You can assume different variance in each direction to account for clusters that aren’t spherical. Then once you have those parameters, you have the variance of each cluster.
To relate this to baseball, assume the two-dimensional data we have is horizontal and vertical pitch movement, and assume that the pitcher in question has three pitches: 4-seam FB, slider, and a curve. Presumably these three pitches will form three distinct clusters when graphed. We run the k-means algorithm to identify which pitch is which (i.e. assign clusters), and then we fit each cluster to a bivariate normal distribution by ML. Then we have the variance of each cluster. Then we can compare the variance (i.e. the consistency) of each pitch’s movement relative to the other pitches, or compare it amongst pitchers with the same type of pitch. And we can track it from game to game, season to season, etcetera, so that we can say that “oh, Erik Bedard’s control of his CB has really improved this season relative to last” with some quantitative oomph rather than with simple visual evidence.
And there are a lot of other advantages to this too besides just getting the point estimate of the variance. We can also get the variance of the point estimate itself to quantify how accurate we think our estimate of that variance is. We can use the bivariate fit in real time, with Bayesian updating to improve the accuracy of the pitch/fx system itself (in identifying pitch type). There are a lot of places to go from here.
I also hear you on the problem with noisy data. That is a universal issue, but there exist a lot of ways to deal with it. I’ve heard of people transforming the data with principal components analysis first (which is a sort of clustering algorithm in itself… kinda) and then running the k-means on the transformed data to get better clustering fits. And lots of other improvements upon the plain vanilla k-means algorithm to deal with tough data. I’m sure there is literature on this stuff somewhere… but I should really shut up because I don’t understand the pitch/fx system too well.
If you’re feeling adventurous, I recommend chapters 20 and 22 of this book as an intro to the stuff I’m talking about: http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/


I was doing some research on clustering last week in search of a Master’s thesis topic and thought something like http://www4.ncsu.edu/~cdwessel/NSMC%20Talk.pdf
might be interesting to use on pitch f/x data.