Friday, January 29, 2010
Clustering exposes bias
This is a good example of something that you can start, but you can’t end there. Steve decided to take SOME traits of a hitter’s profile, and create clusters based on those traits. For example:
I found it mildly amusing that if you only look at batted ball types (excluding HF/FB) that Yuniesky Betancourt and Albert Pujols fall in the same cluster.
This makes it abundantly clear that the traits you choose will drive your results. He gives another example:
The set of clusters I decided to focus on were the ones based on LD, HR/FB, BB. Here are the cluster centers for it, along with the average wOBAs of each cluster.... And here are a couple guys that stand out by having a low wOBA relative to their cluster (potential for improvement maybe?)… this all boils down to an interesting thought experiment
The bold part is the only thing I would strike from his article. After all, he just finished telling us that because of the limited number of traits in one experiment, he had Pujols and YuBet in the same cluster. He took a better set of traits, but still, that doesn’t mean there’s no bias there. Whatever it is he left out, that’s what will drive the differences.
Ideally, you would look at ALL the parameters such that you don’t leave anything out, and therefore, have no bias. Other than the actual identity of the player being special (say there’s nothing in Ichiro’s or Jeter’s hitting profile that could do them justice).
And this is exactly how a forecasting system works: you identify the traits, you figure out the relationship of those traits to the hitting stats, and you come out with a forecast. You will find, after hundreds of hours on this, that:
1. You spent hundreds of hours on this
2. The difference between the point you are now at, and the point you started at, can be measured in inches


Recent comments
Older comments
Page 1 of 344 pages 1 2 3 > Last »Complete Archive – By Category
Complete Archive – By Date