Wednesday, August 22, 2007
Career DIPS numbers
The following file shows the career BABIP of all pitchers and their teammates since 1916:
BABIP, pitcher and mates, career since 1916, min 500 PA (Google Docs).
minYear: first season
maxYear: last season
careerBIP: total BIP, calculated as BFP minus (HR+BB+HBP+SO)
BABIP: (H-HR)/BIP
mateBABIP: his teammates’ BABIP, weighted by his BIP for each season
gap: difference between BABIP and mateBABIP
OneSD: sqrt(mateBABIP*(1-mateBABIP)/careerBIP)
SD: gap / OneSD
trueGap: gap * careerBIP/(careerBIP+3700)
The standard deviation of the SD for all the 2828 pitchers is 1.35. We expected 1.00 if it was purely random (i.e., pitchers have no influence on outs per BIP). This is hardly the case. However, anything with a modicum of influence can achieve an SD = 1.35 if the “n” (BIP in this case) is large enough. The halfway point (r=.50) is with BIP = 3700 or so.
This means that if you have a pitcher with 3700 BIP, you regress his sample BABIP 50% towards his teammates’ BABIP. If you have 1850 BIP, you regress two-thirds toward your teammates’ BABIP. With 7400 BIP, you regress one-third toward your teammates’ BABIP. For your typical pitcher with 500 BIP in a season, the regression amount is 88% (i.e., r=.12, rSquared=.01). It’s on this basis that you will often hear that pitchers have little influence. What’s really true is that one-season’s worth of stats is hardly indicative of his skill.
The regression equation is: 3700/(3700+BIP), and this is how I get the “trueGap”.


OK now given this data, is it possible to add in a bunch of columns for aspects of the pitching style? For example, dichotomous variables for knuckleballer, lefty, etc. As well as spectrum variables, like avg fastball speed, % likelihood of pitch being a changeup, etc? I guess this is something that would have to be hand-coded (and may be quite inaccurate given things like the loss in velocity as pitchers age or get injured)… but it would seem like the last step in “solving” how pitchers are able to demonstrate the skill of easier-to-field balls.
Also - shouldn’t about 5% of the players be outside 2 standard deviations? I count 333, which accounts for 11.8% of the pitchers. 143 of these pitchers are beyond 2 SD worse than average (5%) where you’d expect a little under 2.5% I think. Is there some reason that this should be a non-normal distrubution, skewed with an excess of very good and very bad talent?