Friday, November 21, 2008
Run-based similarity scores
Great work… to which I disagree. Pizza Cutter did similar work based on rate stats, to which I have lots of comments on his thread. My key point is this:
If you are interested in looking for similar players to Vince Coleman, you may insist that the speed components (3b per 2b+3b and sb per sbOpp) be weighted much more than you otherwise would, because you are really interested in the speed players mostly.
So, in a run-based system, the speed components simply won’t have much differentiation. However, since we know the speed is strongly tied to SB, and speed is such a huge component of a player’s skillset, I would heavily overweight that in terms of trying to find similar-style players. Same deal for HR. Perhaps this is best exemplified with the K, which is very close in run value to the typical out, but clearly, there’s a huge difference in a hitter with 180 K and 40K. Basically, the more the component tells you about the player (rather than how much runs it’s worth), the more you should weight it.
I read over both the THT article and PC’s post (and the comments) and decided to see what I could do to do my own sim scores. I used the following categories:
Age
Handedness
Weight
Height
PA
$BB=BB/PA
$SO=SO/(PA-BB)
$HR=HR/(PA-BB-SAO)
$H=(S+D+T)/(PA-BB-SO-HR)
$E=(D+T)/(S+D+T)
$T=T/(D+T)
Components were done as per MGL’s component regression article:
http://www.tangotiger.net/mgl/regression.pdf
Everything was converted into z-scores (minus average and divided by standard deviation) except for handedness, which I sorta faked. Sim scores are computed by summing the difference of sim scores in the different categories. Lower is better.
Here’s the top 100 comps for a few players:
http://www.editgrid.com/user/cwyers/sim_scores_test
I think that I need to weight PAs and BABIP more heavily. Any suggestions?