THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, January 07, 2009

Line Drives

By Tangotiger, 12:17 PM

Brian says:

In Arlington, a batter is 18% more likely to have a batted ball coded as a LD, which may have helped Milton Bradley to have the 2nd highest LD rate in 2008 - while in Minneapolis, it’s 20% less likely. Four of the lowest six LD rates belong to Michael Bourn, Geoff Blum, Ty Wigginton and Hunter Pence, and Minute Maid Park has the second lowest LD park factor at 0.82. This is not saying that Houston batters hit fewer line drives - it’s that Houston and it opponents both have 18% fewer balls scored as liners in Houston than they do on the road.

Right.  A “line drive” is not necessarily a line drive.  If hitters are showing as hitting 20% fewer line drives in the Metrodome than away from the Metrodome, we don’t know if it’s because the Metrodome depresses LD rates, or if it’s because the scorer in Minnesota is depressing it.  Since it makes a huge difference when looking at LD and FB rates, then you need some sort of park factor to normalize the data.  If there is a systematic bias (say when it’s a FB pitcher, or when the balls are hit to RF), then you need a more sensitive normalizer.

All to say that if you have subjective data, then you need to determine if the discrepancy in the data is random or systematic, and then apply your adjustments accordingly.

Taking a guess, I have to believe this is a scorer issue.  A line drive is really a batted ball that leaves the bat at a certain angle, at a certain velocity.  I don’t see how those things would affect whether a ball is a LD, FB, or GB, regardless of the park you are in.  I can see how the scorer can be influenced by the positioning of the fielder (and worse, if the fielder caught the ball ornot), and try to assign a batted ball code.


#1          (see all posts) 2009/01/07 (Wed) @ 13:54

Wouldn’t surprise me if there’s a small bias due to park.  I think Greg from hittrackeronline once mentioned that temperature and humidity could have a +/- 10 feet effect on a home run.  Texas of course is at one far end of that spectrum, so it wouldn’t surprise me if a liner gets a few feet farther before hitting the ground.  I’d bet a lot of scorers have a rule in their mind about when a ball touches down, to distinguish between a liner and a grounder.


#2    Guy      (see all posts) 2009/01/07 (Wed) @ 14:13

I agree this is mostly scorer bias, and I think the caught/not caught factor may be a big contributor.  A quick test would be to check the correlation of LD% with the out% for FBs and GBs by park.  I think you’ll find a strong correlation, suggesting that ambiguous BIP are coded LDs if they aren’t caught, and vice-versa (by some scorers).  Or, if BIP are being coded correctly, then there should be no correlation. 

It would also be interesting to see the LD correlation with FB% and GB%.  In high-LD parks, are these LDs coming out of the FB or GB column. I’d think mostly from FBs....


#3    weskelton      (see all posts) 2009/01/07 (Wed) @ 14:14

This doesn’t surprise me in the least.  It seems like there should probably be a rule that determines line drives by measuring (horizontal ditance / time).  We won’t truly have objective data until the arrival of hit f/x.  Eventually we will discover that this whole “Derek Jeter isn’t a good fielder” stuff is nothing but a conspiracy!

Ok, well, maybe we won’t.  wink


#4    Colin Wyers      (see all posts) 2009/01/07 (Wed) @ 15:06

The key, I think, is to see if this bias persists among scorers. Check the FB/GB/LD rates between BIS/STATS/Retro/MLB and see what park effects are shared between scorers and which aren’t.


#5    Mike Fast      (see all posts) 2009/01/07 (Wed) @ 15:28

Guy, I agree that the caught/not caught factor is very important.  If you look at a chart of the number of fly balls and line drives versus radial angle, the angles which correspond to the gaps between the fielders show an increase in the percentage of line drives and a corresponding decrease in the percentage of fly balls.


#6    Guy      (see all posts) 2009/01/07 (Wed) @ 15:33

Mike:
Great insight.  Can you post link to such a chart?


#7    Matt Mitchell      (see all posts) 2009/01/07 (Wed) @ 15:42

I think we could answer if this were a scorer or park bias if Retrosheet had a way to identify its volunteers who score/code each game. At least until hitF/X finally debuts. I thought the same thing as Tango about this being a scorer issue before I made my first comment on Brian’s post.


#8    MGL      (see all posts) 2009/01/07 (Wed) @ 15:51

I can certainly check the differences in parks between the BIS and the STATS data.  They should be using different scorers.


#9    Andy L      (see all posts) 2009/01/07 (Wed) @ 16:22

What’s a line drive? How does BIS and STATS define a line drive? What about a “scorcher”?

Once we have hitF/X, hopefully we can do away with these arbitrary terms..


#10    Tangotiger      (see all posts) 2009/01/07 (Wed) @ 16:45

I know I sound like a broken record, but a stopwatch and distance travelled is enough for me.  What we care about is how much time it takes for the ball and the fielder to intersect.


#11    Colin Wyers      (see all posts) 2009/01/07 (Wed) @ 17:46

That’d be great, MGL. I have a table at home of Retro and MLB Gameday batted ball codes indexed together that I’ll look at tonight after work (or at least I hope - I’m picking up a good deal of slack at work after this last round of layoffs).


#12    Rob      (see all posts) 2009/01/07 (Wed) @ 19:22

I couldn’t agree more with Tango in #10. If we had a stopwatch, we could avoid this ambiguity and, to paraphrase Michael Bluth, make liners and flies a complete thing of the past. No more liners and flies.

Serious question here ... how much would it cost for MLB to put someone (any baseball nut will do) in the press box with a stopwatch, diagram of Retrosheet codes, and a VCR/Tivo to double-check the tough ones? Say for half the games next year.


#13    Greg Rybarczyk      (see all posts) 2009/01/07 (Wed) @ 21:28

It would be simple to have someone time the flight of batted balls, and with 5 minutes of practice they could be generating data accurate within about 0.1 seconds, IMO (which is plenty good enough for the purpose).

I just got done tracking balls for the International High School Power Showcase at Tropicana Field, and was fortunate enough to have on of my assistants helping me.  On most hits we agreed within 0.03 seconds, and we’ve previously checked against frame counts and found we were within the precision of 1 frame (which is 1/30 of a second).  And this was with lots of hits flying into catwalks and hitting the U.S. flag hanging from the roof, etc.  In most stadiums, it would be even easier than it was at the Trop…

Long way of saying: there’s no technical barrier, someone just has to decide it’s worth it to do this.

By the way, while there I got to watch Bryce Harper hit a sequence of 6 homers in less than a minute that covered 460, 484, 485, 405, 502 and 477 feet.  OK, he used a metal bat, but on the other hand, he’s a 16 year old high school sophomore, so I think some enthusiasm for his future is warranted…


#14    studes      (see all posts) 2009/01/07 (Wed) @ 21:34

From my work in the 2006 THT Annual, there was a greater standard error in line drive rates per park than in GB or Outfield Fly rates. Not outrageously higher, but definitely higher.


#15          (see all posts) 2009/01/07 (Wed) @ 22:18

mlb.tv is $15/mo, $15 for the offseason ... watching condensed games with a stopwatch, the data could be entered into a table which links to a Retro events table


#16    john      (see all posts) 2009/01/07 (Wed) @ 22:38

It’s a shame........mlb tv used to let everyone watch condensed games.  We could have a bunch of ppl looking at the games.


#17    JD      (see all posts) 2009/01/07 (Wed) @ 23:09

Stopwatch, bad. Using video timecode, good. I wouldn’t trust a person with a stopwatch anymore than I’d trust a person to eyeball what is or isn’t a line drive.


#18    Tangotiger      (see all posts) 2009/01/07 (Wed) @ 23:21

Studes, if ever you are going to reprint articles from the book to the site, start with Dudek’s article from the first annual.  It’s a must read.


#19    David Gassko      (see all posts) 2009/01/07 (Wed) @ 23:33

In my article on batted balls and park effects (http://www.hardballtimes.com/main/article/batted-balls-and-park-effects/), I found a slightly positive y-t-y correlation for line drive park factors, but small enough that the range of regressed factors was just 0.97-1.03.


#20    Colin Wyers      (see all posts) 2009/01/08 (Thu) @ 01:02

There doesn’t seem to be much of any difference between the coding of LD between Gameday and Retrosheet - AAE of only .3. (There are discrepancies on GB/FB data, largely grounders, but they don’t seem to be park specific - I think it’s due to missing batted-ball types for ROEs in the Gameday data.)

Honestly, to look at it, I’m not sure the lack of variation can be explained by anything other than the sharing of the same data source.


#21          (see all posts) 2009/01/08 (Thu) @ 06:26

And if you do post articles from older annuals, perhaps you should sell combined PDF e-books of old versions for pennies on the dollar? I’ve lived in Japan the entire time they’ve been being published, and while I love THT, $45 for shipping is a wee bit pricey.


#22    Guy      (see all posts) 2009/01/08 (Thu) @ 08:19

David:  I think there must be a discrepancy between your data and Brian’s (or, you regressed your data too much).  Based on his sample sizes, the standard deviation on LD% is about .0024, meaning 95% of parks should fall within .005 of the mean.  Translated, that means 95% of park factors should fall between 97.5 and 102.5.  But his data suggests a much larger spread.


#23          (see all posts) 2009/01/08 (Thu) @ 08:36

One thing I did not do is recurse the park factors. When I did them before, I ran the procedure twice more adjusting the road parks.


#24    weskelton      (see all posts) 2009/01/08 (Thu) @ 12:04

Question for MGL…

I was obviously joking about the Derek Jeter fielding conspiracy, but it did get me thinking.  I know that UZR adjusts for park.  I’m wondering if this type of scorer bias would be absorbed and thus already covered in your park adjustments or is this potentially something else that would need to be accounted for.

I hope the question makes sense.


#25    Guy      (see all posts) 2009/01/08 (Thu) @ 12:12

Using Brian’s data I get (all numbers approx) the ratio for LD% of observed SD (.019) to random SD (.0026) is about 7.2:1.  That suggests r=.98 and virtually no regression is needed to get the true rates in these parks.  So I’m thinking David used too much regression in his THT article (maybe regressing each year separately?), or his data showed MUCH less variance in LD%.


#26    Tangotiger      (see all posts) 2009/01/08 (Thu) @ 12:52

I know we’re trying to be facetious with Jeter, but I think the Jeter thing was pretty conclusive using WOWY, in the 2008 Hardball Times, wasn’t it?

Are we going to think that the hitters try to hit fewer balls to Jeter than if it was another SS?  That pitchers try to pitch in such a way that they allow fewer balls hit to Jeter than if they had other SS?  That the groundskeeper makes the job harder when Jeter is in the park, than when he is not?

Certainly, given one year, these breaks don’t even out, and so, you have to be a bit careful.  But, after 50,000 balls in play, can’t we presume that the batters won’t hit differently if Jeter is there or not, and pitchers won’t pitch differently if Jeter is there or not?


#27    weskelton      (see all posts) 2009/01/08 (Thu) @ 13:26

Tom,

Yes, the WOWY article on Jeter was very convincing.  I am totally joking.  So please disregard any comments I’ve made about a Jeter conspiracy.

But I think the question to MGL still stands.  Would the park adjustments made by UZR already account for any potential scorer bias?  I guess this would assume that the scorer bias is applied to both home and road teams alike.


#28    Tangotiger      (see all posts) 2009/01/08 (Thu) @ 13:46

It would depend on how the park adjustment is made.  After all, if the systematic bias applies only to balls hit 200 to 250 feet in the air, then a general park factor might not capture that bias.

A more clear example is Fenway’s wall.  One would think that giving a general LF park factor would handle that situation.  But if there is a further non-random bias (say that Redsox CF play differently at home because of the wall), then the park factor won’t work as well.

***

Yeah, I just wanted to use your Jeter comment to make a general point, that no matter what we try to do as scorers in terms of trying to provide better context, if you have 50,000 balls in play hit by guys who hit a million balls in play, then we don’t need to have fine-tuned FIELDf/x information.  The power of FIELDf/x will come when the breaks don’t even out, and that’s at the zero- to two-year time frame.


#29    studes      (see all posts) 2009/01/08 (Thu) @ 15:29

Believe it or not, you can get a couple of the old Hardball Times Annuals in PDF form from Wowio.  Here’s the link:

http://www.wowio.com/index.asp

Just search for “Hardball Times” when you get there.  ACTA puts those up about a year after they’re published in book form.

I’ll look into posting the original THT Annual there, too. Good thought about Robert’s article, Tango.


#30    studes      (see all posts) 2009/01/08 (Thu) @ 16:58

You can now download a PDF of the 2004 THT Annual for free from our FTP site.  The URL is:

ftp://ftp.hardballtimes.com/tht_2004.pdf

It includes Robert Dudek’s groundbreaking (can I say that?) article on hang time.


#31          (see all posts) 2009/01/08 (Thu) @ 19:11

A year late is better than never—and thank you for posting the link on your site. Have you considered putting up old versions of your annuals together with a little press kit about the current year in order to raise interest? If you have enough people talking about the old annual alongside a link to the new one, you may be able to boost sales.


#32          (see all posts) 2009/01/08 (Thu) @ 19:13

Sorry for the double-post, but I also think you may want to link to the download on THT itself, as it doesn’t seem to be anywhere on the front page. I think that there are a good amount of people reading THT who would be interested in the same.


#33    studes      (see all posts) 2009/01/08 (Thu) @ 19:17

Sal, I posted a THT Live note about it, and I will put the links on our “store” page. I’ll mention your PR idea to ACTA.  Thanks.


#34    MGL      (see all posts) 2009/01/08 (Thu) @ 21:16

Yeah, Tango is right.  To some extent the park factors adjust for scorer bias, but to some (probably large) extent they don’t.  For one, I regress the park factors by the seat of my pants, using what I know about the park.  If there is severe scorer bias in on or more parks, I am likely to regress those differences a lot (since they would have nothing to do with the park, per se).  For another, I use very basic park factors, basically a number a number around 1.0 that I use as a multiplier for the “catch rates” for ground and air balls (for example, in LF an Fenway, the catch rate on air balls simply get divided by .78).  That kind of PF is not going to have much of an effect, if any, on biases like mis-classifying the type of batted ball or even distance or direction.  And in addition to bias, there could also be bad and good scorers, such that the distances and classifications of some scorers are going to be better than others.  That is going to wreak havoc on the results.


#35    David Gassko      (see all posts) 2009/01/09 (Fri) @ 03:56

Guy,

You’re doing something wrong. Maybe you’re counting line drives or batted balls as your denominator in calculating random variation? That would do it—and it’s not the right way.

Think about it this way: The y-t-y correlation between run park factors is something like 0.6—how could it be 0.98 for line drives?


#36    Guy      (see all posts) 2009/01/09 (Fri) @ 11:14

"You’re doing something wrong.”

Always a possibility!  I was looking at LD%, LDs in numerator and batted balls as the denominator (actually, my denominator was Brian’s PA number but I should be using BIP—but that won’t change results much).  I can see that with a park factor you’re looking at the ratio of two samples (home and road), which increasese the random error somewhat.  How would you calculate the expected random error? 

On correlation, I was estimating the correlation based on a sample size of 23,000 (Brian’s data), so it would of course be higher than a y-t-y correlation.  However, I would still expect a y-t-y correlation for LD factor of something like .7 or higher, given the spread he’s found with 6 years of data.

Even if I’ve underestimated the random error, Brian’s results still seem completely inconsistent with your highly-regressed estimates.


#37          (see all posts) 2009/01/09 (Fri) @ 15:43

For the LD calculations, I used LD+FB as the denominator. I figured LDs were a subset of all FB, so compared the pct of all FB that were called LD.

I didn’t use any regression in the LD table, but I did list the weighted PA used in the park factor calculation. Normally I like to do as large of samples as possible, with regression to handle those with smaller samples. In my batter projections, I tested various values to find that a past year weighting exponent of 0.7 plus 150 leage average plate appearances minimized the rms.


#38    Colin Wyers      (see all posts) 2009/01/09 (Fri) @ 16:23

Sometimes a liner is confused with a grounder, not a fly ball - the difference between a liner and a shallow grounder is where it touches; if it hits the infield dirt first it’s a grounder, if it hits the outfield grass first it’s a liner.


#39    Guy      (see all posts) 2009/01/09 (Fri) @ 18:08

Using LD+FB as the denominator would certainly reduce sample size.  If your average PA is 23,000 per park, that probably means around 9500 LD+FB.  I’m not sure how to calculate sample error on that, but it still seems to me that Brian’s reported spread is highly significant, with a much larger true spread than David’s regressed results.  Not sure which is correct.

Brian:  can you check to see if there’s a correlation between LD/(LD+FB) and the non-HR FBout%?  One potential cause of a high LD% would be scorers identifying FB hits as LDs.


#40          (see all posts) 2009/01/10 (Sat) @ 12:05

I used FLD_CD (retrieving fielder), which is 0 for out of the park HRs, and also for ground rule doubles. All other hits and all outs are then coded GB, FB, LD or PU.

Here is a more full listing of the LD data
http://spreadsheets.google.com/ccc?key=pLg_vfW0QCD_9Wg6wYUtHEg&hl=en


#41    KJOK      (see all posts) 2009/01/10 (Sat) @ 17:29

I could be wrong about my understanding of how the data is collected, but it was my understanding that:

1. Retrosheet - Collected by volunteers who work only at a specific stadium.  Possibly a small group (2-3?) each year at each site.  At some stadiums, may be sitting behind home plate, but at others, may be down 1st or 3rd base lines, or even in upper deck?  Recording events as they happen, then input later?

2.  Gameday - Similar to Retrosheet, only more consistent vantage point in press box close to behind home plate.  Events recorded AND entered live.

3.  BIS - For MLB, video tape of games from all stadiums are reviewed by small group in Pennsylvania (2-3?), then input. For Minors, same as Retrosheet.

If this is true, I would expect rather high scoring bias in the Retrosheet data, but MUCH LESS scoring bias in the BIS data.


#42          (see all posts) 2009/01/10 (Sat) @ 17:34

KJ, I was say that there would be less VARIANCE instead of less BIAS - the 2 or 3 people scoring at BIS may very well be biased, but if they agree with each other there will be less variance.


#43    KJOK      (see all posts) 2009/01/10 (Sat) @ 17:43

Brian:

Yes, you are absolutely correct.  I think what would cause the most problems in analysis would be the variance in the bias.  As long as there is little variance in the bias, you should still able to get some good analysis out of the data.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 14:49
Mail: rWAR v fWAR

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 13:00
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 12:05
Could Rob Dibble have been a comp for Strasburg?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?

Sep 01 23:16
Strasburg II

Sep 01 22:11
PITCHf/x Summit 2010 - Recaps