THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, June 28, 2011

Clustering and pitch/fx

By Tangotiger, 07:36 PM

Jimmy:

Let’s say you want to identify clusters in two-dimensional data. You an do this using a clustering algorithm such as k-means or soft k-means. In a nutshell, what this does is take an initial set of means (chosen however), evaluate the distance of each data point to one of the means using some distance metric and then assigns a mean to each data point (i.e. the closest mean). Then it re-evaluates the means given the current assignment and steps through the process again, unless it converges and you have the data grouped into “k” clusters.

So this helps with grouping the data points, but let’s say you wanted to go a little bit further. What you can do is run the initial algorithm to find the means and cluster assignments, and then impose the assumption that each cluster is distributed around its mean (which you just found) according to a bivariate normal distribution. Then you use maximum likelihood (ML) to find the variance parameters of the bivariate normal for each cluster, which may vary for each cluster. You can assume different variance in each direction to account for clusters that aren’t spherical. Then once you have those parameters, you have the variance of each cluster.

To relate this to baseball, assume the two-dimensional data we have is horizontal and vertical pitch movement, and assume that the pitcher in question has three pitches: 4-seam FB, slider, and a curve. Presumably these three pitches will form three distinct clusters when graphed. We run the k-means algorithm to identify which pitch is which (i.e. assign clusters), and then we fit each cluster to a bivariate normal distribution by ML. Then we have the variance of each cluster. Then we can compare the variance (i.e. the consistency) of each pitch’s movement relative to the other pitches, or compare it amongst pitchers with the same type of pitch. And we can track it from game to game, season to season, etcetera, so that we can say that “oh, Erik Bedard’s control of his CB has really improved this season relative to last” with some quantitative oomph rather than with simple visual evidence.

And there are a lot of other advantages to this too besides just getting the point estimate of the variance. We can also get the variance of the point estimate itself to quantify how accurate we think our estimate of that variance is. We can use the bivariate fit in real time, with Bayesian updating to improve the accuracy of the pitch/fx system itself (in identifying pitch type). There are a lot of places to go from here.

I also hear you on the problem with noisy data. That is a universal issue, but there exist a lot of ways to deal with it. I’ve heard of people transforming the data with principal components analysis first (which is a sort of clustering algorithm in itself… kinda) and then running the k-means on the transformed data to get better clustering fits. And lots of other improvements upon the plain vanilla k-means algorithm to deal with tough data. I’m sure there is literature on this stuff somewhere… but I should really shut up because I don’t understand the pitch/fx system too well.

If you’re feeling adventurous, I recommend chapters 20 and 22 of this book as an intro to the stuff I’m talking about: http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/


#1          (see all posts) 2011/06/28 (Tue) @ 21:10

I was doing some research on clustering last week in search of a Master’s thesis topic and thought something like http://www4.ncsu.edu/~cdwessel/NSMC%20Talk.pdf
might be interesting to use on pitch f/x data.


#2          (see all posts) 2011/06/29 (Wed) @ 01:04

Jimmy,

The clustering research in PITCHf/x is quite far ahead of the chapters from the MacKay book that you recommend there.  Some of that research has been publicly discussed, but not all of it.  I’ve personally tried many clustering techniques that are more advanced than those mentioned in that text, as have other researchers.  Some of the algorithms show some promise, but all of them are problematic.  Most of the problems have to do with not handling noisy data well.  There are clustering techniques that handle noisy data better, but even they have some trouble.

Ultimately, it seems that you trade accuracy for expediency with a clustering algorithm.  That works okay if you are just wanting to make some decent clusters and draw conclusions about large populations.  However, if you’re trying to measure variance of clusters and get anything meaningful from that, the performance of a clustering algorithm is going to need to be a lot better than anything I’ve seen so far.  (I’d be happy to correspond offline about some of the more promising clustering techniques I’ve looked at.)


#3    Jimmy      (see all posts) 2011/06/29 (Wed) @ 08:07

Hi Mike,

The McKay book is intended as an introductory look, certainly not as a definitive guide to clustering techniques. It only covers the very basics of k-means, and doesn’t even touch some other well-known ones like QT or annealing. The literature on clustering algorithms out there (took a peek last night) actually seems pretty huge as you noted, but I really am not up to date on it at all. Interesting subject though!

At any rate, my main point isn’t so much the clustering algorithm itself, since that subject is still evolving and, like you mentioned, still has problems to deal with. What I was trying to get at more directly was the fitting of distributions to the clusters, to rigorously quantify the spread. In my opinion there are tons of areas in baseball analysis like this where a simple technique from statistical theory can be brought in to change the discussion from qualitative analysis to quantitative analysis. And that means a lot.

In the end, I think for my specific example (at least) the problem might be mitigated with the help of human assistants, since a lot of times it’s easier for a person to positively identify a pitch as a slider versus a cutter, or whatnot. Then the clustering algorithm is even kind of moot and one can proceed with the rest of it.

Mike, I’d be glad to maybe look at some of the techniques you are alluding to in my spare time, just to satisfy my curiosity. Shoot me an e-mail?

gogurt at gmail

Much thanks for your input!


#4          (see all posts) 2011/06/29 (Wed) @ 09:22

I have always thought--but unfortunately haven’t had the time to get into it--that Model-Based clustering would be a good candidate for unsupervised pitch identification.  Similar to the excerpt above, it allows for ellipsoidal and varying variances across pitches, and automatically chooses the optimal number of clusters based on BIC.  This allows the modeling of, for example, a new pitcher for whom we’re not sure of his pitches.  In this case, there is no need to choose k, and less worry about misplacement of the centroids.


#5          (see all posts) 2011/06/29 (Wed) @ 10:42

Millsy, model-based clustering is the technique that I have had the most success with.  It still needs a fair amount of babying.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 04:38
The first time a pitcher has ever intentionally thrown at a batter….

May 25 03:39
Lack of hustle during a game

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 25 00:36
Help needed with sticky issue…

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story