THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, November 21, 2008

Run-based similarity scores

By Tangotiger, 11:02 AM

Great work… to which I disagree.  Pizza Cutter did similar work based on rate stats, to which I have lots of comments on his thread.  My key point is this:

If you are interested in looking for similar players to Vince Coleman, you may insist that the speed components (3b per 2b+3b and sb per sbOpp) be weighted much more than you otherwise would, because you are really interested in the speed players mostly.

So, in a run-based system, the speed components simply won’t have much differentiation.  However, since we know the speed is strongly tied to SB, and speed is such a huge component of a player’s skillset, I would heavily overweight that in terms of trying to find similar-style players.  Same deal for HR.  Perhaps this is best exemplified with the K, which is very close in run value to the typical out, but clearly, there’s a huge difference in a hitter with 180 K and 40K.  Basically, the more the component tells you about the player (rather than how much runs it’s worth), the more you should weight it.


#1    Colin Wyers      (see all posts) 2008/11/23 (Sun) @ 15:48

I read over both the THT article and PC’s post (and the comments) and decided to see what I could do to do my own sim scores. I used the following categories:

Age
Handedness
Weight
Height
PA
$BB=BB/PA
$SO=SO/(PA-BB)
$HR=HR/(PA-BB-SAO)
$H=(S+D+T)/(PA-BB-SO-HR)
$E=(D+T)/(S+D+T)
$T=T/(D+T)

Components were done as per MGL’s component regression article:

http://www.tangotiger.net/mgl/regression.pdf

Everything was converted into z-scores (minus average and divided by standard deviation) except for handedness, which I sorta faked. Sim scores are computed by summing the difference of sim scores in the different categories. Lower is better.

Here’s the top 100 comps for a few players:

http://www.editgrid.com/user/cwyers/sim_scores_test

I think that I need to weight PAs and BABIP more heavily. Any suggestions?


#2    Pizza Cutter      (see all posts) 2008/11/23 (Sun) @ 18:34

I think that the answer depends on what question that you’re trying to answer.  If you want to know whether players have similar run values, you should weight things to their run values.  Or just look at their total value/runs created/linear weights/etc.

If you want to look for qualitative differences between players, (passably non-slow power hitters vs. slow power hitters), then don’t weight things.  This may not tell you anything useful (that’s an empirical question), but suppose we know that different “types” of players age differently.  The trick is finding measures that reliably distinguish between types.


#3          (see all posts) 2008/11/23 (Sun) @ 18:48

Pizza’s method seems much more relevant and applicable if we think about it in terms of projecting pitchers. Imagine how much more successful a projection system could be if we knew the kind of pitcher someone was. A soft-tossing minor leaguer would project differently than a fireballer if they had the same track record. I don’t think Zach’s method works as well for projections as Pizza’s would.

IOW, when scouts say a player “looks like” another player, Pizza’s method would match up with that evaluation better than Zach’s. I think.


#4    Colin Wyers      (see all posts) 2008/11/24 (Mon) @ 18:55

Yeah, it all depends on what you’re interest is. I’m looking at sim scores as a means to projections, with some (limited) success. (At some point I need to integrate three-year weighted averages into the process.) I’ve since added in basestealing as:

$SBA = (SB+CS)/(1B+BB+IBB+HBP)
$SB = SB/(SB+CS)


#5    Tangotiger      (see all posts) 2008/11/25 (Tue) @ 07:55

80% of singles, 60% of walks and hit batters, and 0% of IBB should do the trick.  That’s what I use.

Those are the approx rates of how often 2b is open.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 11:38
Do pitcher’s reach back for velocity when needed?

May 25 11:33
“Why Kickstarter works”

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 10:14
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 17:04
Firefox, IE, or Chrome?