THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, September 20, 2007

Data recorders fit square pegs into round holes

By Tangotiger, 09:44 AM

David’s fine article will serve as the impetus for my diatribe:


A “line drive” is not a “line drive”, as if it was some discrete play, separate from GB or FB.  There is a continuum scale, with non-static demarcation points.  Maybe we can agree on 70% of line drives being line drives.  But, a LD that just lands outside the infield dirt could easily be considered a “GB”.  And a LD off the wall could easily be a “FB”.  And a 200 foot difference in where a ball lands should hardly be lumped into the same category, as it would with a “LD”.  While obviously, overall, this effect is muted, but I would suspect that if you looked at the average distance of Line Drives by Juan Pierre and Albert Pujols, that it would make a huge difference.  There are three parameters that we care about:
1. distance and time of ball from bat to first object (ground, base, player, wall, etc)
2. distance and time of ball from first object to last object
3. distance and time of ball from bat through the infield

Ideally, that number 2 would be “first to second”, “second to third”, .... “nth-1 to nth”.  But, I’d be happy with the above list.  Those pretty much describe what I want (the third one is so that I know if a player caught a ball on the fly that if he missed might have landed 100 feet behind him… asking for height of ball is probably asking for too much). 

Instead, what we get are no time parameters at all.  Hang time people.  Stopwatches.  http://www.HitTrackerOnline.com has no problems with that.  All we can do is infer time based on whether it was hard hit or not, and whether it was noted as a line drive or fly ball.  We can’t tell the kind of groundball.  There is just so much inference going on, and its ripe for bias in the data.

Distance and time.  That’s all a data recorder needs to record.  A data recorder should not analyze the data, and try to fit the square peg into a round hole.  Let data analysts like me scr-w that up for them.  It’s shocking to me how little changes in data recording the major outfits (STATS, BIS, MLB.com) have undergone with respect to batted ball.  And they’ve all been told.

#1    Mike Green      (see all posts) 2007/09/20 (Thu) @ 12:58

I agree 100%, Tango. I e-mailed David about this, but you have covered it completely. 

Project Hang Time sounds a bit too much like something that Pinochet might have ordered.  Other catchy names would be welcome.


#2    MGL      (see all posts) 2007/09/20 (Thu) @ 13:14

With all due respect to DG, who is a fine thinker, researcher, analyzer, etc., I was not particularly fond of that article.

For one thing (and no small thing), as was already pointed out in the BallHype comments, as soon as DG reported a decent y-t-y correlation for line drive value, the first thing I thought was, “Of course.” As Tango pointed out, Pujols’ line drives are going to be harder and longer than Eckstein’s.  Not to mention the line drive home runs which are virtually only hit by power hitters.  Why DG was surprised at this result I don’t know.  Must have been a bad day for him.

And if he thought (likely incorrectly) that it was mostly due to park effects, why not look at players who switched parks from one year to another?

Another thing I did not like were his comments about Ichiro and ground balls.  One has to be very careful with drawing certain conclusions or inferences based on correlations between two sample groups of data.  For example, let’s say that all the players in your sample have about the same speed, save one (we’ll call him Ichiro).  Overally, your correlation on ground ball value is going to be near zero (it will probably still be a little positive, as the better hitters will hit harder ground balls), but for that one player (the fast one), he will consistently have a higher GB value than everyone else.  IOW, a correlation based on a sample of players does NOT necessarily tell you anything about an individual player in that sample, especially if there are a few players in that sample who have unusual qualities that affect the performance you are measuring.  Now, if you don’t know anything about an individual player, then you can correctly conclude that that player likely has little skill wrt to the performance in question.  But if you know something about that player (like he hits lots of ground balls to the opposite side, or he is very fast going from home to first), then you can throw the correlation out the window.

Finally, David also missed the reason for the fairly low correlation wrt bunting for a base hit.  Game theory!  The more a player bunts, the more the defense must play for the bunt, and less successful he will be (relative to a stable defense).  IOW, this will push the regression line towards a slope of zero, which is (at least somewhat) the case with all things that implicate game theory.  For example, if a player is only a decent bunter and not that fast, he can bunt every once in a while and probably succeed 50% or more of the time.  If he bunts any more than that, the surprise element will decrease and his success rate will probably go below the BE point.  When Pierre is at bat, the third baseman is playing so close that no matter how good a bunter he is or how fast he is, his bunt rate is only going to be 45% or so, still greater than HIS BE rate.  So if the guy who is a poor bunter or not fast bunts once or twice a year and is successful 55% of the time and the fast, good bunter bunts 30 times a year and is only successful 45% of the time, what do you think the correlation is going to be?  In fact, I am surprised it was that high.

Tango, one of the reasons why the data collecting companies (STATS, BIS, etc.) do not use ideal techniques is that they are not yet doing it for the hard core researchers.  They are mostly doing it (read: selling it to) for the teams that are not particularly interested in hang times, etc.  Knowing that a batted ball is a “line drive” in the “opinion” of some stringer is just fine for them.


#3    tangotiger      (see all posts) 2007/09/20 (Thu) @ 13:23

MGL (last paragraph), I’m not sure that’s the case.  At least two “insiders” have told me about the GIGO issue.  I suspect that anything bought from the Outfits comes with a certain large degree of uncertainty.  What these companies need is an independent body (like readers of this blog) to certify their data.  We’re like Eliot Ness’ Untouchables (more like that accountant, than Andy Garcia or Sean Connery).


#4    Rally      (see all posts) 2007/09/20 (Thu) @ 15:03

It was mentioned in Moneyball that the A’s have their own in-house fielding system.  I would suspect the Red Sox have something as well, and my guess is that they have some interns watching all the games on extrainnings and recording their own data, stopwatches in hand.  And they’ll never publish any of it.

Teams like that aren’t going to bother with STATS or BIS.


#5          (see all posts) 2007/09/20 (Thu) @ 23:08

Time is crucial for grounders, too.  Aside from bad hops (which are becoming uncommon), what determines the difficulty of fielding a grounder boils down to the location of the grounder, the initial location of the fielder, and the time it takes to get there from home plate.  So, if we time grounders, we ought to be able to get rid of the need to qualitatively classify grounders and hard, medium or soft, and thus have a better idea on infield play as well…


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Nov 20 01:43
Sabermetric Moves of the 2009 Pre-Season

Nov 20 04:02
Nate Silver: hero to interviewers

Nov 20 02:01
My 1B is better than your 1B

Nov 20 00:26
MLB logo

Nov 19 23:03
NBA’s Marcel

Nov 19 19:13
Offense by position groups by decade

Nov 19 17:32
Changes in home run rates during the Retrosheet years

Nov 19 16:40
One Year and One Million Hits Later

Nov 19 16:22
Soria as a starter?

Nov 19 13:50
Response of a fired head coach