THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, December 02, 2008

Estimating BABIP

By Tangotiger, 11:05 AM

Good stuff.  Unfortunately, it is presented as a black box, but I like all the different components that were presented.


#1          (see all posts) 2008/12/02 (Tue) @ 11:14

Do you agree with the use of BB%/SO% as the Hitter eye factor, Tango? To me that seems to be a little flawed, but I wonder if better can be done without Pitch f/x data. They don’t specify if that’s unintentional walk percentage or overall walk percentage, either.


#2    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 11:20

If BB and SO share the same numerator, then that’s the same thing as BB/SO.

Personally, I would change all those ratios into rates, so BB/(SO+BB).  Otherwise, you are saying that you can increase the denominator of your ratio by 100 times, and it won’t really affect the value as much as increasing the numerator of the ratio by 100 times.  This applies to the other ratios they noted.

And, it doesn’t really matter if they call it “eye” or whatever.  BB/(SO+BB) is what it is, and certainly EVERY component you can invent has SOME impact to BABIP.  It’s our job to figure out the degree of relevance.

And, IBB should likely be removed, but then again, IBB itself (with or without BB or SO) might lead to an inference of BABIP.

Also, they didn’t mention “HBP” as needing to be removed from the denominator in one of the other measures (but that is probably just an oversight).

You also have the issue of bunts.


#3    Matt Mitchell      (see all posts) 2008/12/02 (Tue) @ 12:53

I don’t think this is as much of a black box as you may think. This is written much like a lot of traditional regression model papers in stats journals, and it seems to me they give all the directions to reproduce their results. It’ll just take some work to set up the data and the ability to use a stats program like R. Of course, even if you did that, it’d be nice to see what their coefficients for the variables were.


#4    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 13:24

They seem pretty open, based on their response at ballhype.  So, I expect to see the coefficients at some point.


#5          (see all posts) 2008/12/02 (Tue) @ 13:53

I wonder if this really tracks better than simply taking historical hit rate averages for different types of BIP.  Sure it’s better than adding .12 to ld% but is the increased complexity worth it rather than simply having coefficients?

Maybe I’m missing something here. . .


#6    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 14:51

I would definitely start with the league hit rate for each of GB, FB, LD, Pops.

I would also need the frequency rates of each of those for each player, and regress a certain amount.

Finally, add in speed.

That I think is the minimum you should do, and would become your baseline.  Anything beyond that, as the authors are doing, is what I would test against.


#7    Peter      (see all posts) 2008/12/02 (Tue) @ 18:18

Thanks for the comments here and on ballhype, Tango. We struggled with the issue of bunts, and to be quite honest we didn’t know how best to deal with them. I wonder, for example, if we underrated some of the speedy guys who are rather poor hitters but tend to bunt more often (like Gathright and Bourn).

Also, I believe Chris has the coefficients, and he should be able to provide them in the near future.


#8          (see all posts) 2008/12/02 (Tue) @ 19:55

Good stuff!
I think there are a few things that could be cleaned up in both the methodology and the presentation.

1. Using Pitches_perEBH as one of the independent variables is potentially problematic as it is somewhat dependent on the dependent variable (which balls in play fall for hits and which are caught).

2. Something tango already pointed out above: with a lot of these variables there is a choice of what form to use (FB/GB or GB/FB or FB/(GB+FB)) and the results you get may depend on this somewhat arbitrary choice.  I think he’s right that the latter form (rate) is best.

3. “Our regression model yields an R-squared value of .348 ... As an additional test of accuracy, we find a robust 59 percent correlation.” This is redundant, no? Note that .59^2 = .348 and .18^2 = .03.

4. It would be nice to have the final formula, but even better to have some measure of the relative importance of each of the 13 independent variables to the final answer, which might also suggest a simplified form we could use for approximate calculations. (It would also indicate how big of a potential problem the points raised above are, depending e.g. on how significant Pitches_perEBH is.)

Overall, nice work though.


#9    Tangotiger      (see all posts) 2008/12/02 (Tue) @ 22:09

Their article has been generously offered here:

http://tangotiger.net/tufts/


#10    Jay Gibbons      (see all posts) 2008/12/03 (Wed) @ 07:09

Maybe they should have included hr rate or hr/fb.  I’d imagine that would say something about how hard players generally hit the ball.


#11    Chris C.      (see all posts) 2008/12/03 (Wed) @ 10:12

Wow - talk about overfitting a regression. That model is far too ungainly to be taken seriously, but I agree with the general point that we can do much better than LD% + .120


#12    CSDutton      (see all posts) 2008/12/03 (Wed) @ 11:41

Chris - Keep in mind that the purpose of this paper is not to develop a “convenient” predictive model, but rather to identify and explore the factors that truly determine BABIP - many of which had been previously overlooked.

We can easily simplify our model for the sake of usability, but that defeats the purpose of striving for accuracy.


#13    Chris C.      (see all posts) 2008/12/03 (Wed) @ 13:01

I don’t care abou the ‘convenience’ - it’s exactly the ‘accuracy’ that worries me.

What happens if you try to cross-validate this? In general, if a model has more predictors than I can count on my hands and toes, then I assume it isn’t generalizable beyond the dataset and the ‘variance explained’ is meaningless.


#14          (see all posts) 2008/12/03 (Wed) @ 13:30

I’m going to use the old-fashioned method, Marcel style, predicting future BABIP by a weighted mean of past BABIP.

I’ve got to rewrite my projections query somewhat to get the output to more closely resemble theirs, so I won’t be spending hours in Excel doing copy/paste.

They had an rms of 8.0%, while LD% had an rms of 12.1% (not yet weighted for each player’s PAs). A partial analysis of my projections, not using any data from the season being tested for, has an rms of 9.2%.

I reported it as a percent because that’s the way they did in their spreadsheet. I’m going to also measure the absolute error, and use the rms of all batters compared to the league mean as a benchmark.


#15    Rockytop      (see all posts) 2008/12/03 (Wed) @ 14:40

They didn’t report the actual results of the model, nor did they report the adjusted R^2, which is far more important than the R^2 when a model involves that many independent variables. Also, they need to present a correlation matrix of their independent variables. It would not at all surprise me if their model suffered from multicolinearity because a lot of those IVs seem to be derived from the same data points.


#16    CSDutton      (see all posts) 2008/12/03 (Wed) @ 14:45

Adjusted R-squared is over 33% (vs. R-squared of 34.8), so this model isn’t inflated by the number of explanatory variables.  We also ran a correlation of all independent variables and found very little multicollinearity between them.


#17    CSDutton      (see all posts) 2008/12/03 (Wed) @ 14:48

Also, the extensive results are available in the original paper (linked above)


#18    Rick      (see all posts) 2008/12/03 (Wed) @ 23:55

Shouldn’t we be concerned about the significance of the independent variables as much as the overall fit of the model?  I have to believe there’s some correlation between those variables and multicolinearity pretty much renders all of those coefficients (and thus significant tests against them) completely unreliable, right?


#19    Zach      (see all posts) 2008/12/04 (Thu) @ 00:16

I’d be more interested in seeing the R-squared using for the next year rather than the current year. Does this model better help in predicting a player’s future performance, or stripping the “luck” out of the current year? It would also be interesting to see how the standard LD%+.120 equation matches up in current and next year R-squared, as well.


#20    CSDutton33      (see all posts) 2008/12/04 (Thu) @ 00:23

I understand the concern of multicollinearity, because that’s something I’ve obviously had in mind throughout the analysis.  Granted I’m no stats expert, but I did spend time running correlation matrices and variance inflation factors for all the primary explanatory variables.  Based on the guidelines I’ve been given, multicollinearity doesn’t seem to be biasing the estimators.  One variable that did seem to cause some concern, however, was pitches per plate appearance, so I have removed it in more recent models.

Independent variables also showed consistent, highly significant coefficients as other variables were added or removed - another sign that multicollinearity may not be a major concern.

Thanks for the input guys, I’m hoping to continue to revise the model and work out some of the kinks over the next few weeks.


#21    Pizza Cutter      (see all posts) 2008/12/04 (Thu) @ 01:10

Could you do it stepwise?  That would knock out some of the multi-colinearity problems.


#22          (see all posts) 2008/12/04 (Thu) @ 09:03

This is a very interesting read.

I never liked using only LD%. For one, it has a smaller sample than the rate it is estimating. I don’t recall anyone using regression and historical weighted means on LDs to find it’s “true level”.

I do think that I can do nearly as well as this method by using weighted means of past BABIP, definitely better than LD%.

What I would like to find from this research is where each player’s BABIP skill comes from - what percentage is from speed, hitting the ball hard, hitting the ball far, contact rate, etc. This would help project a player, as these things have different age curves. For example, a player who relies mostly on speed should expect an earlier decline than one who’s best skill is making contact.


#23    Tangotiger      (see all posts) 2008/12/04 (Thu) @ 11:49

More comments here:
http://ballhype.com/story/batters_and_babip/


#24    dcj      (see all posts) 2008/12/13 (Sat) @ 18:32

Just running across this for the first time. I must have overlooked it before.

I like this article a lot. You all clearly put a lot of work into the research. Contrary to Chris C., I think the number of independent variables is appropriate. I’m also reassured that multicollinearity is not a big issue.

Let me go through the variables one by one (except P/PA, which you have removed).

LD% (+ effect on BABIP)—quite reasonable.
Speed score (+ effect)—ditto.
FB/GB ratio (- effect)—ditto.

All these make good sense. Like Tango said, you could add in popups. THT keeps track of infield flies per fly ball, so the data is out there.

P/XBH (- effect)—I agree with andeux that dependence on BABIP could be a problem. In fact, my first reaction is to get rid of this variable completely. Is there something else that captures “well-struck-ness” without depending on BABIP?

Quite apart from that, you’ll get issues with players having very few XBH. I think this is why Gathright’s expected BABIP is so low; he had only 4 XBH all season. At a minimum you should take the reciprocal of this variable.

Contact rate (- effect)—The article says that players who swing harder tend either to miss the ball completely or to hit it hard, leading to a lower contact rate and a higher BABIP. This seems right. Another way to look at it is that players who can’t make consistent contact have to compensate in other ways in order to stay in MLB. In particular, they have to get better results than average when they do strike the ball.

From this point of view, the negative correlation between contact rate and BABIP is partly an artifact of selection bias. So, it’s okay for established MLB hitters, but if there’s a rookie with low contact rate and low BABIP, maybe he just isn’t good enough to stick.

Spray (- effect)—I like this one. Could the reason for its significance be that opposite-field batted balls are more weakly hit? Or do pulled GB/LD by dead pull hitters differ systematically from pulled GB/LD by spray hitters? A combination of both factors?

BB/SO (+ effect)—Not so keen on this. Mark Reynolds and Freddy Sanchez are very different hitters, but their BB/SO are almost the same. You already have contact rate, which is going to be strongly negatively correlated with SO%. I think you could replace BB/SO with straight BB%, or if BB% is correlated with contact rate, use BB% above or below the level predicted by contact rate.

--

The next step is to see whether some players consistently outperform or fall short of their expected BABIP. It looks like you’ve taken steps in that direction, but before you have a result, it’s premature to make predictions with the level of certainty you displayed in the article.

Finally, I want to compliment the authors again for a very solid study as well as their openness to feedback. Good work!


#25    dcj      (see all posts) 2008/12/13 (Sat) @ 19:06

By the way, I think a next generation of UZR could make good use of this type of study. We have a batted ball to CF, and UZR uses the BIS or STATS data to estimate the ball’s chance of dropping for a hit. If we know that the batter was Jim Thome, should that change our estimate?

One approach is to look at the batter’s history of hitting balls into that zone and adjust the estimate accordingly (based on the size of the sample). But, most batters may not have a big enough sample size to be meaningful.

In that case, it would make sense to classify the batter according to a list of attributes, just like this study does, and use regression to estimate how much easier or more difficult to field are this batter’s CF fly balls than average.

Of course, the batter’s attributes (like contact rate, LD%, etc.) would themselves need to be regressed to avoid wacky results for players with 3 career AB.


#26    Tangotiger      (see all posts) 2008/12/13 (Sat) @ 19:19

That what I use in WOWY.  I don’t really care where Jim Thome hits the ball.  I just know how often each position on the field converts his balls in play into an out (against LHP and RHP, and whether they are FB or GB pitchers).  If the CF makes an out on 8% of his BIP, and Ichiro makes an out on 10% of Thome’s, then I presume that: a) Thome had the same ball in play distribution, and b) Ichiro is good.

Given a large enough sample, any of the quirks in assumption a) cancels out, leaving me with whatever I find in b).

And, I think it worked out pretty well in the THT08 article to show that.


#27    dcj      (see all posts) 2008/12/13 (Sat) @ 21:37

I just reread those WOWY articles. That must have been where I got the idea to control for the identity of the batter.

Let me lay out where I’m going with this. We can estimate for each BIP how likely it is to be a hit. I want to give the hitter credit for that fraction of a hit, regardless of whether the fielder converts it to an out or not. Likewise, I charge the pitcher with that fraction of a hit. (This is PZR.)

With the data we have now, this wouldn’t work. Ichiro hits a GB to short, UZR says 90% out, but he beats it out for a hit. If he can beat out that GB 60% of the time, he should get 60% of a hit.

I’d have to think about this more carefully, but I am guessing that in order to avoid circular reasoning, we need to get the 60% estimate from some other source than Ichiro’s empirical success rate on that type of GB. Speed score and handedness, maybe.

We probably need Hit F/X and Field F/X to make this super-accurate, but we may be able to get close with the data we have and some clever adjustments.


#28          (see all posts) 2008/12/15 (Mon) @ 11:50

To #24:

I agree about K/BB ratio being a poor decision to include in this model. In the feedback we have received, it has become clear that the model would be better off with K rate and BB rate as separate stats, for just the reason you mentioned: they are independent of one another. We plan on fixing this in the future.

I also agree that there is probably some kind of bias inherent in this, especially in the negative correlation between BABIP and strikeout rate. If a rookie strikes out a lot, he’s going to have to hit the ball very hard when he does make contact in order to get enough hits to stick in the majors. However, we are not using K rate as a predictor of BABIP: it’s incorrect to say that hitters with high strikeout rates tend to have high BABIPs - rather, I’d say that hitters who are successful with high strikeout rates almost HAVE to have high BABIPs, otherwise they aren’t successful.

We aren’t claiming that a high K rate leads to high BABIP, but rather a high BABIP is necessary if a hitter is going to be successful with a high K rate.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:42
Who is Jeremy Lin?

Feb 11 19:33
Clutch analogy

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential

Feb 11 16:48
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 11 10:29
Dwight Evans

Feb 11 02:12
Performance through the ages

Feb 10 23:01
For Your Soul

Feb 10 18:32
Moneyball at Villanova