THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, April 14, 2011

Batted ball data: good or bad?

By , 12:59 AM

Here is a quote from this article entitled “Batted Balls and Home Runs” by Studes on BP.  To be fair, the article is not about trashing the integrity of batted ball data.

I’ve played with batted ball statistics for a while now, just about as long as the Hardball Times has been around. Batted ball stats, as compiled by Baseball Info Solutions, are just plain cool. Knowing how often batters hit line drives, or how often pitchers force infield flies, adds a new understanding to the game and results in metrics like xFIP and xBABIP and xWHIP—the “x stats"—as well as advanced fielding stats.

But lately, batted ball stats have been taking some hits (get it?). Colin Wyers (formerly of THT and now researching everything at Baseball Prospectus), has identified several reasons why data recorders in different parks might interpret the angle of a batted ball differently. Colin has only gotten more skeptical over time; I think he recently referred to line drives as “lie drives” somewhere (Twitter?).

It’s not that we don’t generally know what a line drive is. It’s that the definition between different batted ball types is gray, and when you start looking at small samples of batted ball stats—for individual batters, say, or pitchers—there may be some significant differences in how specific balls are classified. You may not know what you think you know.

Many people, including the authors of “Bad Hops” (the infamous Hirsch Brothers), have lambasted advanced metrics that use detailed batted ball data, citing the bad quality of the data as one of the reasons why these metrics are no good.  Of course describing the data as being of “bad quality” is not very helpful, since the quality of the data is a continuum.  As well, it is usually small minds who reduce things to an either/or, black/white, or bad/good dichotomy in order to prove or support a thesis.  Things are rarely that simple.

In any case, this article gives us some good data (which should not surprise you) which supports and evinces something I have been trumpeting for a long time:

We can start out with a defensive metric that simply gives credit for an outfielder catching a ball hit in the air or not, whether those air balls are in a particular zone assigned to that particular fielding position or whether we use the parameters of the batted ball location (again, not perfect data) to determine how often a fielder should have and does catch a particular ball or set of balls (along with any other parameters we may choose to use, like the perceived speed of ball, etc.).

That type of system would be better than a simple range factor, and would also (presumably) be better than a system which doesn’t have batted ball data but tries to infer it from more traditional data like the handedness and G/F ratio of the pitcher on the mound (and other things), like TotalZone or DRA.

Now, if we wanted to do even better than that (using location of air balls), we can try and break down those air balls into categories which approximate and reflect how long they are in the air, and thus how difficult they may be to catch, given a certain location (and the presumptive starting location of the fielder).

Now here is the important thing:  It does not matter how you break down those air balls or how much integrity the resultant data contains, as long as what you categorize as a short time in the air is indeed shorter than what you characterize as a medium time in the air, which in turn ends up being in the air for a shorter time than what you categorize as the air ball with the longest hang time.  IOW, even if there are all kinds of mistakes, even horrible ones, what you end up with is ALWAYS going to be better than not doing any categorization (of hang time) at all!

For example, let’s say that I decide to split up all the air balls into 3 categories, which is typical of some of the batted ball systems - pop fly, fly ball, and line drive (BIS uses more categories, adding fliner fly and fliner line drive).  And lets’ say that I am horrible at doing the categorization - I am half blind, I pay little attention while watching the game, and I can’t read my own handwriting after the session is over (and I am “strung out” half the time while “stringing"), such that 95% of my categorization is completely random (what I call a fly ball is equally likely to be any type of air ball), and only 5% are reasonably accurate.  You may think that this is a total travesty and anyone relying on a system that uses such data is out of their mind.  And you would be completely wrong.  A system that uses such data will be BETTER than a system that treats all air balls equally.  In fact, data from this irresponsible stringer might look something like this, if we had actually timed each batted ball in the air:

Type of air ball Average time in the air, adjusted for distance

Fly balls 3.5 seconds
Pop files 3.6 seconds
Line drives 3.4 seconds

Now, that is not data that is going to be real helpful, and if we had perfect categorization, or even a much better stringer, we might see something like:

Fly balls 3 seconds
Pop files 4 seconds
Line drives 2 seconds

But, with our horrible stringer and data, we are still BETTER off than not using any sub-categories at all and treating all air balls alike.  This is a very important point when it comes to rebutting the critics of advanced metrics when they start attacking the metric via the integrity of the data.

By the way, here is what Studes found with respect to home runs as classified by BIS, where the height and hang time is known almost exactly through Greg R’s hit tracker web site:

Type Tot Apex Dist Ratio
Fliner liner 83 57 374 0.15
Fliner fly 1328 72 392 0.18
Fly 3151 95 400 0.24
Grand total 4562 87 397 0.22

The other thing is if you look at that chart or even if the numbers were different, but still in the same sequence, can you tell whether the data is good, fair bad, excellent, or otherwise?  No.  I suppose if you could get the real distance, angle, and hang time data on every ball (and not just the HR), you could establish a base line for “perfect data” but even with that, or absent that, calling the data “bad” because it is far from perfect and assailing the metric because of that, is not helpful and is in fact wrong, for the reasons mentioned above.


#1    Zac      (see all posts) 2011/04/14 (Thu) @ 08:38

On Football Outsiders, Aaron Schatz likes to say that “the best is the enemy of the better” which is basically the same idea that your are espousing here.


#2    Tangotiger      (see all posts) 2011/04/14 (Thu) @ 09:52

I agree with everything MGL said except:

IOW, even if there are all kinds of mistakes, even horrible ones, what you end up with is ALWAYS going to be better than not doing any categorization (of hang time) at all!

The exception is with systematic biases.  If for example one particular (but not all) stringers always marks as a flyball any 3-second ball that is caught and/or always marks as a linedrive any 3-second ball that is not caught, then we’ve got a problem.

As long as you have random noise, then what MGL is saying is true.  To further illustrate MGL’s point: let’s say you improperly marked all of the road games of all of the fielders, sometimes putting Chipper Jones on the road for the Nats, sometimes putting Zimmerman on the road for the Cubs, and so on.  (These things are done randomly.) So, half the data is junk, but you don’t know which half of the data.  You just know that half the data is pure junk.  That’s still OK!  You still have signal from half their games.  All you’ve done is increased the uncertainty of your estimate.

But, if it was systematic, if you swapped half of Chipper’s games for half of Zimm’s games, then that’s not ok.  That’s a systematic bias.

And sample size does not help systematic biases.  Indeed, (I think) sample size will make systematic biases worse!

The problem is that we haven’t determined the level of systematic biases present.  When we see Andruw Jones be +112 runs under one data provider and +0 runs under a second data provider, over 6 or 7 years, then we suspect that there is some sort of systematic bias.


#3    Peter Jensen      (see all posts) 2011/04/14 (Thu) @ 09:54

By the way, here is what Studes found with respect to home runs as classified by BIS, where the height and hang time is known almost exactly through Greg R’s hit tracker web site:

Height is hang time because gravity is a constant.  Greg doesn’t measure height, he computes it from hang time.


#4    MAH      (see all posts) 2011/04/14 (Thu) @ 10:16

Tango, I agree. As mentioned repeatedly throughout Wizardry, the essence of defensive evaluation is finding the least biased, most accurate predictors of expected plays at each position.  Wizardry discusses ways in which DRA and Sean Smith’s systems have biases.  I am currently working on a new version of DRA just for contemporary players that should be out next month.  ‘New’ DRA will eliminate the biases identified in Wizardry, except for park effects, which I will get to shortly.  That’s not to say new DRA will be the most _accurate_ defensive system. But for players with two years’ of performance, it might be.  And it will be open source and replicable.


#5          (see all posts) 2011/04/14 (Thu) @ 21:03

Reminds me of high school. I gave up a 350 foot flyball to center. The centerfielder spun in circles and failed to make the catch. We got into it in the dugout. He said it was a smoked bomb. I said it was a lazy 350 foot flyout that every competent centerfielder i had ever played with would have caught with ease. It went on and on, each one of us firmly convinced about the nature of how well the ball was struck, and the ease with which the ball should have been caught. Eventually, the coach came over and told us we both sucked and that we both needed to shut the eff up and that we were both shitty teammates for blaming each other. Coach told him it should have been caught but then told me that maybe he would have caught it if you weren’t boring us all to tears walking batter after batter and not striking them out. “you can’t expect everyone to be at full attention when you throw about 3 strikes an hour.” Then he benched both of us and made us run poles together for the rest of the game.


#6    MGL      (see all posts) 2011/04/14 (Thu) @ 21:52

Funny story, #5.  As far as DRA, I really liked the book (Wizardry) and I love DRA, however…

A batted ball system will ALWAYS (always, always, always) be better than a non-batted ball one, as long as everything else between the two systems is equal and there is no significant bias in the batted ball data that cancels out the value (we can debate all day about the bias).  There is no need to argue that point because it can be easily proven in one sentence.  Take your DRA.  Then simply find one player in all the data to whom DRA assigns X number of balls but the batted ball data knows with near 100% certainty that that player had x+1 balls hit in his vicinity.  Add that one ball and you now have a better system than DRA (obviously 99.9% of the value in the system is from DRA).  That proves my point.  It is like I always say about park factors.  Don’t like them?  Don’t trust them?  You want to use NO park factors?  I say, B.S.  I’ll see you your no park factors and raise you Coors Field with a PF of 1.01 and Petco Park with a PF of .99.  Who wins? I do!

Same with DRA.  You give me your best DRA, non-batted ball system and then give me some batted ball data (location) from retrosheet, BIS, or STATS (or wherever). I will ALWAYS beat for DRA.  Always!


#7    MGL      (see all posts) 2011/04/15 (Fri) @ 01:49

"I will ALWAYS beat for DRA.”

That doesn’t sound right.  No “for” should be in that sentence…


#8    Colin Wyers      (see all posts) 2011/04/15 (Fri) @ 01:51

Okay, let’s do this. Hitting the high points:

* I should probably touch briefly on Studes’ study first. It’s diligent and I admire the spirit of inquiry behind it, but I think it’s wholly inapplicable to any of the material you present here, MGL. In short:

- The study focuses on home runs, by definition a batted ball that passes over the outfield wall. This gives everyone involved an objective point of reference to use for determining the location of the batted ball, not only in terms of depth and height but horizontal location as well. Conclusions about the accuracy of stringer-provided batted ball data is going to underestimate the amount of inaccuracy (both in terms of random and systemic error) because these are frankly the easiest air balls to score objectively.

- It’s also not a very representative set of batted balls in general - no infield flies and precious few liners/fliners/etc. The paucity of borderline classifications (given the apparent definitions here) significantly dampens the statistical power of the conclusions, and makes it hard to apply them to a broader spectrum of batted balls.

- As Studes notes,

Hit Tracker is based on people watching video, using stopwatches and then applying some advanced math. There’s some subjectivity and room for error there.

But let’s clarify here - by video we mean the commercial broadcast feeds, which is the same video feeds that BIS gets and the same feeds Studes consulted when checking over the data. In other words, we have not one but two common elements between the three observers (Studes, HitTracker and BIS): the video itself. Since the video source itself is one of our major suspected contributors to bias in batted ball data classification, comparing sources that share a common bias is going to understate the amount of error.

So I don’t think we gain any clarity in this discussion by bringing Studes’ data into it.

*

Many people, including the authors of “Bad Hops” (the infamous Hirsch Brothers), have lambasted advanced metrics that use detailed batted ball data, citing the bad quality of the data as one of the reasons why these metrics are no good.

- I haven’t read the book, but if that’s the spirit of their arguement against UZR than they are of course totally correct that poor data quality is a serious problem with the metric, the fact that the problems with the data haven’t been seriously investigated or addressed is of course another.

*

Now here is the important thing:  It does not matter how you break down those air balls or how much integrity the resultant data contains, as long as what you categorize as a short time in the air is indeed shorter than what you characterize as a medium time in the air, which in turn ends up being in the air for a shorter time than what you categorize as the air ball with the longest hang time.

No.

*

And lets’ say that I am horrible at doing the categorization - I am half blind, I pay little attention while watching the game, and I can’t read my own handwriting after the session is over (and I am “strung out” half the time while “stringing"), such that 95% of my categorization is completely random (what I call a fly ball is equally likely to be any type of air ball), and only 5% are reasonably accurate.  You may think that this is a total travesty and anyone relying on a system that uses such data is out of their mind.

Let’s leave aside the question of the state of mind of someone who would use this data - I frankly think that any practical gains from such data would be nonexistant, even if the 5% of non-random data was “perfect.”

But it’s a total nonsequiter; the hypothetical data set you propose is, assuming I understand the terms of the hypothetical you propose, objectively better in quality than what we see in the stringer data currently being collected. What you’re basically saying here is that due to the placebo effect, giving sick people sugar pills is going to be more effective than giving them medicine - and then you’re going on to use that as a justification for treating patients with arsenic pills.

*

A batted ball system will ALWAYS (always, always, always) be better than a non-batted ball one, as long as everything else between the two systems is equal and there is no significant bias in the batted ball data that cancels out the value (we can debate all day about the bias).

Can we, please? It’s really the only interesting feature of the conversation - everything you say only so much as resembles reality only under conditions where the bias inherent in stringer-collected batted ball data has a lesser effect on the results than the “signal” you’re trying to capture. So what can we say about bias? Well, we can say that it exists. The magnitude is large enough that it can be seen clearly with very coarse analytical tools - I was able to find substantial evidence of park effects in batted ball trajectory data without having so much as home-road splits, as one example. So we have every reason to think that the bias has a significant (in a statistical sense) effect on more finely tuned analytical tools, especially since those analytical tools seem to have been designed using the faulty assumption in the quoted sentence (that there is no significant bias in the batted ball data).

We can basically break down the data into three components:

- Actual reality. This is what we are trying to measure - on its own, we think that knowing this will improve the quality of our metrics.

- Random measurement error. This is the noise. In large enough samples this is going to have little to no effect on our metrics; in smaller samples this will decrease the accuracy of our metrics.

- Systemic measurement error. This is the poison, we know it is occuring in the data, and we know that it will always (always always always) reduce the accuracy of our metrics relative to a hypothetical metric with the same amount of actual realtiy and random measurement error but no systemic measurement error.

Okay! So for batted ball data to improve our metrics, the magnitude of the effect size from number one must exceed the combined effect size of groups two and three. I’ve been begging people to present any evidence they may have for this claim for almost a year now, and have heard nothing. Does anyone care to do so now?


#9    joe arthur      (see all posts) 2011/04/15 (Fri) @ 08:19

Colin,

there is an additional problem, apparently present within the BIS data, which would not fall under systematic error. [As a clarification, to me systematic error has to do with measurement distortion, like mis-calibration in a pitchF/x system which systematically interprets a pitch as 3 inches further outside than it really was, or the pressbox parallax effect you have suggested as distorting the observation of trajectory.]

Since introducing the fliner in 2006, more batted balls each year have been identified as fliners, at the expense of “regular” line drives and flies. The “out rates” for each batted ball type have shifted in a way consistent with the explanation that the boundaries between BIS batted ball types, whether subjective or objective, are shifting.

This isn’t necessarily observation error (e.g. due to camera angles changing over time in a biased direction), it may be due to an unannounced or unrecognized change in classification criteria. This was a problem with BIS in its earlier years as well; before introducing the fliner, when they had just line drives and fly balls (like STATS), they had a great deal of year to year variation in counting line drives.

I agree with your first 6 or 7 paragraphs about the relevance of Studes’ article to the question, but otherwise I think you and Mickey are talking past each other. His thought experiment only talks about random error; much of what you raise has to do with non-random errors, which park effects and year-to-year normalizing might adequately correct for.

I have no position yet on the overall disagreement. Nothing can be settled conclusively solely by looking at aggregated data. People need to be looking at the actual individual data points in depth.  And I think that is what’s nice about Studes’ article, even though I agree that it doesn’t help in any way to settle the argument about batted ball data reliability or fielding metric reliability.


#10    antonio      (see all posts) 2011/04/15 (Fri) @ 10:22

Colin-

Tango suggested a test a while back to see if UZR predicted nFRAA(T+1) better than nFRAA as a way to go about presenting some of the evidence you are looking for:

http://www.insidethebook.com/ee/index.php/site/comments/conclusions/#86

Would you also find the results of that test to be valuable in assessing the situation at hand?


#11    MGL      (see all posts) 2011/04/15 (Fri) @ 15:40

"I agree with your first 6 or 7 paragraphs about the relevance of Studes’ article to the question, but otherwise I think you and Mickey are talking past each other. His thought experiment only talks about random error; much of what you raise has to do with non-random errors, which park effects and year-to-year normalizing might adequately correct for.”

Exactly.


#12    Colin Wyers      (see all posts) 2011/04/15 (Fri) @ 16:09

Since introducing the fliner in 2006, more batted balls each year have been identified as fliners, at the expense of “regular” line drives and flies. The “out rates” for each batted ball type have shifted in a way consistent with the explanation that the boundaries between BIS batted ball types, whether subjective or objective, are shifting.

http://www.insidethebook.com/ee/index.php/site/comments/john_dewan_and_research_assistant_speak/#3

* UZR uses multi-year samples, while Plus/Minus adjusts for year-to-year league changes.

I leave this as an exercise to the clever reader.


#13    Colin Wyers      (see all posts) 2011/04/15 (Fri) @ 19:15

I agree with your first 6 or 7 paragraphs about the relevance of Studes’ article to the question, but otherwise I think you and Mickey are talking past each other. His thought experiment only talks about random error; much of what you raise has to do with non-random errors, which park effects and year-to-year normalizing might adequately correct for.

Well, his hypothetical isn’t even correct, at least not as stridently as he puts it. So 5 BIP per every 100 BIP are classified correctly? If I take 90 outfielders (three per team) and give them each 50 BIP, I should expect… something like five of them to have 0 correctly-classified balls - in other words, all of their data will come from the random pool. I find it inconceivable to construct an argument that says that the hypothetical batted ball data is improving anyone’s estimates at that point.

But even if we replace what he said with a more reasonable sentiment that is in that vein (say, that in a large enough sample, we should reasonably expect to see that sort of batted ball data as posed in his hypothetical to improve the quality of the estimate - note I’m not saying I agree with this, just saying that it’s a reasonable reformulation of MGL’s decidedly unreasonable assertion), it is still a counterfactual hypothetical that doesn’t address the issues seen in the data. And it does not seem to me apparent how park factors (either current park factors, which are almost certainly inadequate by design, or hypothetical park factors that at least attempt to address these sorts of issues) or year-to-year normalizing would possibly address problems such as range bias.


#14    Colin Wyers      (see all posts) 2011/04/18 (Mon) @ 20:05

I haven’t gone into as much detail here as perhaps I should; I will do so now.

Let’s take MGL’s preferred example of park effects, and let’s do what he proposes. I ran a simulation of every game last season, assigning a random number of RPG to each game based upon the observed frequency of RPG in the league. Then I went ahead and figured a set of park factors, using the actual RPG 5% of the time and the randomly-generated RPG the other 95% of the time. (These are home-only PFs, so roughly double the magnitude of effect you’d see on Baseball Reference or whatnot.) So, lemme ask you. Would you use this set of 2010 park factors?

TEAM PF
SEA 1.2166
KCA 1.1715
TEX 1.1536
LAN 1.1378
CHN 1.1204
ANA 1.1131
ARI 1.0876
MIL 1.0815
SFN 1.056
CHA 1.0314
SLN 1.0188
CLE 1.0097
PIT 1.0072
CIN 1.0055
FLO 0.9887
TBA 0.9751
ATL 0.9707
NYN 0.9707
DET 0.9524
NYA 0.9449
BAL 0.9333
OAK 0.9226
MIN 0.9182
COL 0.904
HOU 0.8956
BOS 0.89
WAS 0.8897
TOR 0.8895
PHI 0.8863
SDN 0.8609

Or these?

TEAM PF
ANA 1.2089
NYN 1.2053
ARI 1.178
TOR 1.1181
KCA 1.1048
OAK 1.0818
COL 1.0689
CHA 1.0588
SFN 1.0553
BOS 1.0473
PIT 1.0445
CHN 1.0355
PHI 1.0216
NYA 1.0081
SLN 0.9876
SEA 0.9799
SDN 0.9797
HOU 0.9721
FLO 0.9678
WAS 0.9609
CLE 0.9592
CIN 0.9464
BAL 0.9431
MIL 0.9327
TBA 0.9003
MIN 0.8979
ATL 0.8975
LAN 0.8971
DET 0.8466
TEX 0.8389

Both of those are based on 5% real data and 95% random data, and both of them are about as useful as a sharp kick to the head. They’re actually worse than the null assumption (that all parks are equal). Regressing to the mean doesn’t help, except to where it lowers the magnitude of the mistake you’re making in using them at all. (At a sizable enough amount of regression, you end up with no significant difference between these and the null assumption, so I mean if you regress them 98% or so I guess you’re not hurting yourself any.)

To extract something useful from this rather bizarre thought experiment, the principle at work is: when dealing with data that has measurement error, the data is valuable so long as the variance caused by what you’re trying to measure is greater than the variance caused by measurement error. This is true with or without bias (since apparently nobody cares about bias in batted ball data, I thought I’d point this out).

Now, let’s go back to classical test theory:

Var(Obs) = Var(True) + Var(Rand) + Var(Bias)

Now, in our example above, 95% of the data is totally random - so Var(True) and Var(Bias) fall out. Let’s assume no bias in the other 5% of the data, so just Var(True) and Var(Rand) apply. In smaller samples (as presented here, even a single season is a smaller sample), the magnitude of the Var(Rand) in the random data vastly outstrips the Var(Obs) of the other 5%. As you increase in sample size, Var(Rand) will start to drop as the random values begin to coalesce around the mean, and so more and more of the proportion of the variance explained by observed events goes up, and the odds that the observed events are representative of reality goes up. So in a large enough sample, these rather junky park factors will converge upon the sort of “Coors Field 1.01, San Diego 0.99” park factors MGL talks about. In other words, the random data will behave as “ballast,” keeping the values from getting too far from the mean.

Now of course with real data, it’s not so neatly regimented into “good” and “bad” data, so figuring out what’s going on gets a lot messier. But the principle still holds: for the data to be valuable above the null assumption, the proportion of the variance explained by reality needs to be greater than the proportion explained by measurement error (both random and systemic). That’s comparing everything to the null assumption. MGL goes on to say:

You give me your best DRA, non-batted ball system and then give me some batted ball data (location) from retrosheet, BIS, or STATS (or wherever). I will ALWAYS beat for DRA.  Always!

We’ve already shown why this may not be true, even if the best non-batted ball system uses the null assumption. But of course the best non-batted ball system doesn’t use the null assumption. Repeating myself, so nobody can miss this point:

The best non-batted ball system will be better than the null assumption.

So even if with batted ball data you can get better than the null assumption - and that’s supposed to be a function of sample size, so long as we assume that measurement error is predominantly random (I’ll cover this in a moment) - then it doesn’t follow that you can beat the best non-batted ball estimate of batted ball distribution. The first thing to keep in mind that as the statistical power of the batted ball data grows relative to the noise, the effects of batted ball distribution regress to the mean. So the more confident we can be in the batted ball data, the less necessary it is - over larger samples, most of the things that will maintain a persistent difference in batted ball distribution are things we can probably determine without batted ball data.

And of course we’ve only barely mentioned bias. What bias does is increse the variance explained by factors other than what we’re trying to measure, because unlike variance with a random cause, variance caused by systemic measurement error does not coalesce towards the null assumption as the sample size increases. In other worse, bias makes it possible for batted ball data to never be as good as the null assumption in determining batted ball distribution.


#15    MAH      (see all posts) 2011/04/19 (Tue) @ 10:40

"[A]s the statistical power of the batted ball data grows relative to the noise, the effects of batted ball distribution regress to the mean.”

“[V]ariance caused by systemic measurement error does not coalesce towards the null assumption as the sample size increases.  In other worse, bias makes it possible for batted ball data to never be as good as the null assumption in determining batted ball distribution.”

Thanks, Colin.  I’ve had similar thoughts but have never been able to express them so concisely.


#16    Tangotiger      (see all posts) 2011/04/19 (Tue) @ 10:41

I only understood some of what Colin was saying.  I was trying to see what it is that I agree with him.  This is something like I’ve said in the past:

So the more confident we can be in the batted ball data, the less necessary it is - over larger samples, most of the things that will maintain a persistent difference in batted ball distribution are things we can probably determine without batted ball data.

Basically, because of systematic biases, a non-hit-location based system would be preferred to a hit-location based system *at some point*.  For example, give me 15 years of Derek Jeter, and tell me who his pitchers, batters, parks, and game state is, and which fielders caught the balls, and who his fielders are, and that’s going to be about as much as I need to know.  Because with 60,000 balls in play, how disproportionate of a batted ball distribution could he have faced (after accounting for the parameters I noted).  If we see some skew, that’s probably a sign of systematic bias than real difference in hit locations.

Now, for one game, the hit-location data tells you far more.  Felix for example could have a “flyball” night, even though historically he’s a GB pitcher.  Felix can give up 10 gap singles in one game, but a non-hit-location system has no choice but to assign some of that to the infielders.  Even worse, something like WOWY will give a disproportionate share of those to the infielders because it uses the knowledge that Felix is a GB pitcher.

So, somewhere between one game (hit-location preferred), and 15 years (hit-location likely not preferred), there’s an equilibrium point, the point at which the systematic biases and the extra knowledge of hit-location data are even.  I don’t know what that point is.  I’m guessing six years.

Now, I don’t know if Colin is saying any or all of this.  I don’t know if he concedes that in a single-game universe that hit-location data is far preferred to not knowing it at all.

As for this:

Both of those are based on 5% real data and 95% random data, and both of them are about as useful as a sharp kick to the head. They’re actually worse than the null assumption (that all parks are equal)....(At a sizable enough amount of regression, you end up with no significant difference between these and the null assumption, so I mean if you regress them 98% or so I guess you’re not hurting yourself any.)

I agree here as well.  If 95% of the data is random, and if your choice is 100% park factors or 0% park factors, then you use 0% park factors.  This is true of anything.  It’s true of DIPS, too.  But, there’s nothing stopping us from regressing 95% or whatever it should be.

So, I either agree with Colin, or I’m not following what he’s trying to say.


#17    MAH      (see all posts) 2011/04/19 (Tue) @ 11:40

I’ll of course let Colin address the points as he sees fit. But it seems to me that if one applies the DRA method to objective Retrosheet data to account for systematic shifts in batted ball opportunities, there is enough randomization in 600 batted balls per season or 1200 batted balls over two seasons (for a shortstop, say) to make DRA as good as a batted ball system, because “the effects of [non-DRA-modeled] batted ball distributions regress to the mean.” And that’s still not taking into effect the systematic biases that have been identified already.

I also hasten to add that flawed as they are, batted ball systems have played a vital role in developing better non-batted ball systems and might still be the best.  It’s just hard to tell, particularly because the underlying data is not public.


#18    Colin Wyers      (see all posts) 2011/04/19 (Tue) @ 13:42

So, somewhere between one game (hit-location preferred), and 15 years (hit-location likely not preferred), there’s an equilibrium point, the point at which the systematic biases and the extra knowledge of hit-location data are even.  I don’t know what that point is.  I’m guessing six years.

Well I’m certainly not saying anything resembling this.

Ignore bias for a second (after all, everyone else is). You STILL have random measurement error in the batted ball data. Everything you’re saying here is based on the assumption that the amount of signal contained in the batted ball data is greater than the noise in very small samples, down to the single game. Your example consists in using the data to determine ground ball or not ground ball. You’ve essentially picked the most robust use case for the batted ball data, and are using that as the example. But let’s consider that most batted ball metrics are in fact using far more detailed views - something between three and six batted ball types, yes? Depending on whether or not you’re using infield flies or fliners or gliners. Stack on top of that 22 zones (or more!) and batted ball distances and soft/medium/hard… increasing the amount of things you’re trying to measure obviously increases your potential for measurement error.

When you go to that level of detail, I think you’re seriously underestimating how much noise there can be and how that can overwhelm any of the signal that’s in the batted ball data, over “small” samples. And all of these variables have a substantial effect on the estimates of expected outs, otherwise we’d expect to see a much higher level of agreement between batted-ball methods and non-batted-ball methods. And if random measurement error were small we should expect to see greater agreement between sources of batted ball data. And if methods for handling any of these concerns were obvious and effective, then we should expect to see greater agreement between metrics built upon the same batted ball data.

But we don’t.

Rather than making a WAG about when non-batted ball metrics become better than batted ball metrics, I’m pointing out that it’s very possible that THIS IS NEVER TRUE. I’ve been making this claim for long enough that I’m hoping at some point people will start believing that I’m serious about at least considering the idea.

If 95% of the data is random, and if your choice is 100% park factors or 0% park factors, then you use 0% park factors.  This is true of anything.  It’s true of DIPS, too.  But, there’s nothing stopping us from regressing 95% or whatever it should be.

Nothing’s stopping us, but that doesn’t make it useful. Let’s take the example where we have a park factor of 1.22 for Seattle - we regress that 95% to the mean, and we get 1.01. So by regressing we’ve gone from being significantly wrong to being insignificantly wrong (a park factor of 1.01 simply doesn’t make much of a practical difference). Sure, being practically indistinguishable from the raw data is probably just as good as the raw data, but you haven’t ACHIEVED anything. And of course by going to all this extra effort (which, again, has bought you no benefit whatsoever) you increase your chances of making a mistake somewhere along the way. If the park factors in the example are the best you can do, you’re better off without park factors.


#19    Colin Wyers      (see all posts) 2011/04/20 (Wed) @ 14:20

And six years? Really? Six years for expected outs to stabilize? That’s just totally incredible, if by “in” you mean not and “credible” you mean able to be believed.

If I take a look at the data on Fangraphs, it looks to me like for SS you can expect the year-to-year correlation for expected outs per inning to be roughly .5 given an average of 1250 innings per season, or about 138 games. (They don’t publish expected outs for UZR anymore, so I had to make some estimates - I don’t think I’m far off the mark, but if someone wants to provide better figures I’m open to it.)

Using your shorthand method from here:

http://www.insidethebook.com/ee/index.php/site/comments/career_dips_numbers/#10

I can get r equal to .7 in just over two seasons. Where does “six” come from?


#20    Tangotiger      (see all posts) 2011/04/20 (Wed) @ 14:43

Colin/18 (first half):

You are saying that my use case (the extreme case) is unfair, because it is extreme, and so, I can’t possibly lose that particular argument.  But, I presume, you are saying that if I take a random number of games, that the measurement error of those may be larger than… what exactly? 

If we ignore systematic biases, then I don’t see how your argument can hold water.  Last night, in the Jays/Yanks game, in the final inning of a tie game, Nova gave up three deep fly balls.  On one of them, the runner from 1B had already rounded second when the OF caught the ball.  Another was also deep, and the third one was a deep double that scored the runner from 1B.

Now, do I need to know if the balls went 320 or 340 feet?  As long as the error is not a systematic bias, then, no, it’s not that important.  It’s better than not knowing that at all.  Because if I don’t know that all were deep, then I’m down to presuming they were all “average”.  And not just average FB distance, but possibly a GB or LD too.

So, I don’t see your argument.  (Or I don’t follow your argument.)

However, your argument is extremely strong if you focus on systematic biases. 

***

As for the six years, I didn’t say it took that long for expected outs to stabilize.  I said it would take (total guess) six years for a method that depends on hit-location data to be as reliable as a method that does not depend on hit-location data.

So, if I had 15 years of WOWY and 15 years of UZR, I’d lean strongly on WOWY to tell me the truth (because of the potential for systematic biases in UZR, a bias that does not exist in WOWY).

If I had 1 year of WOWY and 1 year of UZR, I’d lean strongly on UZR, because the systematic bias in UZR would be smaller than the randomness of batted ball distributions of WOWY.

“Six” years is a total guess.  I mean, if you tell me a better guess is two years, or ten years, then fine.  The number is not the selling point in my argument.  It’s just there to frame the discussion.

(All numbers for illustration purposes only.)


#21    Tangotiger      (see all posts) 2011/04/20 (Wed) @ 14:53

One way to test my WOWY v UZR is to look at 6 years of WOWY and 6 years of UZR and find out which correlates best with next year’s outs per BIP (preferably with pitchers/parks that are not part of the original 6 years).


#22    Sky      (see all posts) 2011/04/20 (Wed) @ 15:12

I get that we’re better off without park factors for Seattle (they were one of the teams that was way off) but if we look at all thirty teams, are we better off WITH park factors? Yes, you’ll make mistakes, but the gains should outweigh the losses, no?


#23    Rally      (see all posts) 2011/04/20 (Wed) @ 15:36

"Yes, you’ll make mistakes, but the gains should outweigh the losses, no?”

In that example, Colorado should wind up at 1.01 more often than other teams, and San Diego more often at .99.  To claim that such park factors are completely useless, or worse than not using any, is wrong.

But the benefit is extremely small, and probably not worth the time it would take to calculate it.


#24    Tangotiger      (see all posts) 2011/04/20 (Wed) @ 16:16

When Voros use to publish his forecasts, he would HEAVILY regress his park factors.  That’s where I got the idea to not apply park factors at all with Marcel.  The range is so small, and the random variation in next year’s 500 PA so large, that it’s not worth it for almost all cases to park adjust.

A bit of park adjustment is still better than none at all, but, yes, it’s alot of effort for little gain.


#25    Colin Wyers      (see all posts) 2011/04/20 (Wed) @ 16:39

Well, look, I meant what I said and I said what I meant.

RMSE of those two sets of park factors with the actual park factors (using the same method, but all real data):

0.14416151
0.132703809

Standard deviation of the “real” park factors: 0.118048442

So, yes, those park factors have a greater error than if you just used 1 as the park factor for all parks.


#26    Tangotiger      (see all posts) 2011/04/20 (Wed) @ 16:59

Can you post the “real” park factors that you used?


#27    MAH      (see all posts) 2011/04/20 (Wed) @ 17:14

I’ll offer up another suggested summary of Colin’s message, acknowledging of course that I might not have this quite right.

First, no evidence has been presented that random, unbiased coding error in BIS and STATS data is small enought to ensure less error in estimating expected plays per player per year than a DRA system applying all publicly available Retrosheet data from 2003 to the present. 

Given the 400 to 600 or more batted balls seen by each of the seven positions per team, once you take into account whether or not the ball was hit in the air, batter hand, pitcher hand, batter’s career out distribution (AROM’s idea), runners at first, etc., etc., I find it very hard to believe that there would remain differences in such batted ball sample that current batted ball systems would reliably capture (that is, with a meaningful amount of signal cutting through the random measurement noise).  This is just my opinion, but the opposite position is, so far, just an opinion as well. 

Second, there are known, significant, and perhaps very large biases in batted ball data.  We seem to keep skipping this point and then moving back to the more theoretical and less easily resolved point about the effect of unbiased measurement error.  But this is, if anything, the more important point.  As a practical matter, what are we actually supposed to do with batted ball data defensive runs estimates given the fact that the data is secret and published results based on such data suggests large inconsistencies and biases? As I emphasize below, I do not mean to imply that we should ignore them.

The next points are my own. 

The arguments based on the usefulness of batted ball data for short time frames is weak, because proponents of batted ball systems always say to look for two to three years of data anyway. I think a fair test is two full seasons. 

The basic principles of DRA have all been disclosed in my book, and I’m currently putting together number for current players using the 2003-10 data.  (If people read the book carefully, they can more or less figure out how to generate those 2003-10 ratings on their own.) When the numbers are out on-line, folks can get a better sense of just how far we can get without batted ball data, and then begin to assess the marginal benefit of batted ball data, taking once again into consideration that it remains proprietary, appears subject to large biases, and might not add as much precision as we thought.

And again--to be clear--I continue to believe that MGL is providing a great service by generating UZRs, and they should always be part of the discussion.


#28    tangotiger      (see all posts) 2011/04/20 (Wed) @ 19:04

The short time frame is definitely not weak, and bringing in the two-year thing is out of context. 

In the two-year thing, you ignored the condition.  Specifically, I said the reliability of UZR after 100 games was equivalent to the reliability of wOBA or RC after 50 games.  To the extent that is true, then this statement is true:

You know two years of UZR to be as certain as one year of offensive stats.

You can’t therefore just say that you need two years period.  And certainly, you can’t then use the two year statement to make the point you did.

I could have said you need 6 years of UZR to get the same reliability as 3 years of wOBA.  You can’t then say you need 6 years of UZR period.


#29    Colin Wyers      (see all posts) 2011/04/20 (Wed) @ 21:52

As for the six years, I didn’t say it took that long for expected outs to stabilize.  I said it would take (total guess) six years for a method that depends on hit-location data to be as reliable as a method that does not depend on hit-location data.

Can you explain this, please? I mean, we can break the formula for any defensive metric down to:

(Outs-ExO)*RunVal

Outs is trivial to derive with play-by-play data; run values are a bit more open to differentiation, and can involve batted ball data, but still will remain pretty stable between systems. So the biggest differentiator of any two fielding metrics is expected outs; when MGL says that batted ball data will always beat non-batted ball data for fielding estimates, what he means is that batted ball data will give you a better estimate of expected outs than not having it.

There are two sources of variance in batted ball distribution:

1) Systemic factors - pitcher groundball/airball tendencides, batter handedness, park effects.

2) Randomness.

What people like MAH and myself are claiming is that you can do a very good job of estimating number one based upon factual data, without relying on batted ball data. So for batted ball data to improve the estimate of expected outs, it has to improve our ability to estimate changes in the random causes of expected outs.

So the questions that need to be resolved here:

* What’s the spread of expected outs for a fielder in MLB over a certain number of chances?

* How much of that is random, versus systemic?

* How well does batted ball data estimate the expected outs caused by randomness?

Because of random measurement error, the larger the sample size, the better the batted ball data should do at predicting the expected outs, presumably. But here’s the kicker:

Because the batted ball data is biased, while reality isn’t, the variance of batted balls in reality will decrease more quickly than the variance estimated by batted ball data.

So is there ever a point where the sample becomes large enough that batted ball data provides a useful estimate of expected outs, while not so large that bias becomes a problem? I don’t see how that circle gets squared. Because remember, for each batted ball, it’s not just how that one batted ball is scored, it’s the number of outs recorded on batted balls that are scored the same. So even if you get the “correct” information on a particular batted ball, the estimate of expected outs on that batted ball could still be deeply in error, if the balls its being compared to are scored incorrectly.

Now a lot of people act like batted ball fielding systems are robust against these sorts of things, but as we’ve seen the systems are actually very sensitive to their inputs and the assumptions that are made in processing them. So it seems to me very, very likely that if you don’t know the right answers to some of these questions, using batted ball data is likely to make your estimates worse, not better.

(What I used to think, and am not sure that I do anymore, is that if you DO know the answers to these questions, you can make good use of the batted ball data.)


#30    Tangotiger      (see all posts) 2011/04/20 (Wed) @ 23:30

First, a little precision in our defintions:

1) Systemic factors - pitcher groundball/airball tendencides, batter handedness, park effects.

That should be pitcher identity, batter identity, and park identity (though the third one I presume is the same thing).  In addition, you have runner identity and game conditions (inning, score, out).  And climate.  Maybe one or two others that escape me. All those things are variables in what will influence the distribution of batted balls.

Once you know all those things, and how they interact, we come up with a “true talent distribution”, or expected distribution of batted balls at that point in time.  I.e., our mean estimate.

What actually happens, our sample, will give us random variation around that mean.

***

Now, from the perspective of the fielder, we don’t care about the mean estimate (true talent distribution) at that point in time.  All we care about is the observed result (the actual ball in play) at that point in time.

If we actually had and knew the observed batted ball that is.  We presume we don’t, and therefore, how best to estimate it is the question.

***

Now, if you are suggesting that a systematic bias puts us dead in the water, then you can make a strong case in that regard.

But, it doesn’t seem that you are talking about that.  You seem to be taking an even stronger position that the random measurement error of the observed batted ball is so large that we would be better off just going with the (recorded) identity of the fielder, and inferring, based on the “true talent distribution” (mean estimate), the characteristics of that batted ball.

Your implication is that if you know that Felix threw to Vlad at Safeco with a man on 1B, of a high leverage game, at 9pm, at 20 celsius, and that Vlad hit a single and that (someone recorded that) Ichiro picked up the ball, that that will give you a better estimate than if someone recorded that the ball had a 3.5 second hang time, at +22 degrees, 270 feet from home plate.

In a nutshell, is that the implication of your argument, that even on a random single batted ball, the measurement error (i.e., someone recording those three numerical parameters) will be larger than the inferred estimate (i.e., using the identity of the entities involved, plus of course someone recording the identity of who touched the ball)?

(Again, setting aside the systematic bias of someone measuring 3.5, 22, 270.  And I’m not yet conceding that there is no, or little, measurement error if someone records that Ichiro picked up that ball.  But let’s leave that aside for the moment.)


#31          (see all posts) 2011/04/20 (Wed) @ 23:45

Tango, you’re missing a large piece here.  UZR and similar systems never consider a single batted ball on its own.  The characteristics of your batted ball from Vlad at Safeco is only part of the picture when UZR rates that batted ball.  UZR also considers all “similar” batted balls to establish its baseline for expected outs when applying a value to that single batted ball.

So it’s never about measurement error on a single batted ball.


#32    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 00:10

Mike: I don’t see why this is an issue.  Again, let’s ignore (for the sake of this particular discussion) the systematic biases, which I happily concede is a major issue.

If we have a measurement as I describe it for Felix/Vlad of 3.5 seconds, 22 degrees, 270 feet, with all the other environment I noted, and we figure out based on all the other batted balls (also with a similar measurement error) has an out rate of 20%, then the uncertainty level around that 20% is going to be pretty low.  It’s not as if we’re going to get 20% out rate at 270 feet, 40% at 280, 30% at 290 and 50% at 300.  It’s going to be a smoothed out function, and it’s going to make alot of sense.

So, the baseline numbers, which frankly I wouldn’t even need to figure out based on empirical data (I could just spend time modelling it theoretically, with some reasonable estimate on jump time, speed, etc), cannot be a big issue.  It’s a tiny issue.

***

Again, the argument against hit-location system should rest largely on systematic biases.  I’ve read every single post by Colin and others, and, as best as I understand it, I am not swayed in the least by the (random) measurement error issue.

The systematic biases though?  Those are potentially big, as we saw with the Andruw Jones example I keep citing, where sUZR was close to 0 and bUZR was close to +110 runs, over a 7 year span.  Just that could be enough to shake your confidence. 

The rest?  Not at all.  In my opinion anyway.


#33    Colin Wyers      (see all posts) 2011/04/21 (Thu) @ 00:51

Tango, how can you be so confident that random measurement error has no significant effect on the accuracy of UZR and similar defensive metrics? I mean, I’m baffled here.

Let’s take your example ball of 3.5 seconds, 22 degrees, 270 feet. Let’s say - just for the sake of illustration - that the standard error on those is a half second, 5 degrees and 20 feet, plus or minus on all of them. These numbers are totally fabricated based on I like things in multiples of five.

So in reality that batted ball is somewhere between 3 and 4 seconds, between 17 and 27 degrees, and between 250 and 290 feet. I mean, I know you say you’re smoothing the curve, but you have to have differentiation between the extremes of those values, don’t you? (And in practice, nobody smooths - it’s all very coarse dropoffs between zones and the like.) I mean, I don’t see how you can argue that the specific parameters of the batted ball have zero effect on the estimate of expected outs - and I don’t think you are arguing that. I think you’d conceed, if I asked you point blank, if errors in the batted ball data would lead to errors in the estimate.

My question for you is - how can we know whether or not the effects of random measurement error are significant if we don’t know the magnitude of random measurement error? I’m not asking if you think it’s occuring with current defensive metrics, or if it’s likely to happen with Field F/X or Trackman data - I’m simply asking if people can agree that it’s theoretically plausible that there is such a thing as data so riddled with errors that it’s more harm than help. If you think data matters at all, I don’t see how you can claim - as MGL very plainly does in the blog post that started this thread, so I’m not the one who chose this particular battlefield - that there is no amount of random measurement error that would invalidate a dataset.

As for whether or not it’s affecting anything in a practical sense - I don’t know. I don’t have a good idea of how much random measurement error there is any any particular dataset. I mean, if you do know, please share that knowlege and maybe I’ll be as convinced as you. But I’m really tired of having conversations on fielding metrics that start from conclusions and then MAYBE, if there’s time, work their way back to evidence. Frankly, I think if we ever find ourselves spending more time on answers than questions, we’re at risk of doing bad sabermetrics - and not only that, but BORING sabermetrics. At least, unintersting to me.

And so I’m asking questions. I’m hoping people are interested enough in the questions to try and find answers. For instance, earlier I asked a question (sort of) about changes in rates of batted ball types over multiple seasons in BIS data. I’ll go ahead and come out with it now - if the definitions of batted ball types are changing between seasons, doesn’t that mean there could be problems caused by using more than one year to establish expected out rates? Is this an area where the Fielding Bible stats are better than UZR, contrary to what we might have expected if we didn’t have this bit of information? (I, for one, was pretty convinced for the longest time that multiyear baselines HAD to be superior.)

In terms of random measurement error - Mike raises some very good points. Here’s what I want to add.

Ignore the individual fielders for a second. After all, when it comes to evaluating player offense, first we come up with a model of how team offense works, and then we apportion out to players, right? Otherwise we have no grounding on which to call OBP a better measure of hitting prowess than batting average. We can make that claim (and convince other people of that) because we can validate it at the team level, which is what truly matters, since baseball is a team sport. So focus for a second on team defense - once you have a working model of teams, you can start on individual players. Otherwise you have a horse/cart sequencing problem and it gets messy.

So, in terms of expected outs, we take our batted ball data. How long does it take for our best estimate of expected outs to coalesce upon average, once we apply a park adjustment? And for any one batted ball, how much does the batted ball data contribute to our ability to estimate the out probability for that one ball in play?


#34    MGL      (see all posts) 2011/04/21 (Thu) @ 01:03

”...that there is no amount of random measurement error that would invalidate a dataset.”

That is correct, as long as the error is handled reasonably well.

I don’t know how to explain it any clearer or better than I already did.  If you, for example, had 3 types of balls, hard, medium, and soft, as long as your hard balls were harder to catch then your medium balls, which were in turn harder to catch than your easy balls, no matter how bad you were at classifying balls, you are better off classifying them than not classifying them (with the caveat being that the differences between the groups were significant enough, and you have SOME confidence in your classifications, such that you are reasonably sure that they are not noise).  The same goes for any other classifications - location, type, etc.  I don’t see how it can be any other way…


#35          (see all posts) 2011/04/21 (Thu) @ 07:49

Tango/32, my point is that the random and systematic biases interact.  They’re not independent sources of error.  Therefore, you can’t do your favorite trick of
Variance(total) = Variance(A) + Variance(B) + Variance(C)

We’re also lumping a lot of things into systematic biases, it seems, based upon only having two categories of error and very tightly defining random error.  There’s a whole group of things that you could call construction or math errors, such as you mentioned above with whether or how you choose to smooth the data and what Joe and Colin mentioned about multi-year baselines in the face of changing data approaches.

I won’t disagree with your assertion that the Andruw Jones example casts significant doubt on the accuracy of UZR.  But I don’t see yet how we can neatly apportion the blame for that to the systematic biases (things like range bias and catch/no-catch bias).


#36    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 09:59

If we can’t have a discussion separating random errors from systematic biases, then we’re not going to agree on anything.

If random errors and systematic biases intersect, then we’re only talking about systematic biases.

The presumption of the discussion is that there are NO systematic biases.  Let’s discuss it at that level.

That is, let’s agree that the ONLY errors are of the following variety:
1. Someone does a somewhat sloppy job of estimating distance, hang time, and angle.  He does it because he’s distracted or something, or he’s not paying close attention.  He is not systematically biased.

2. Someone does a somewhat sloppy job of transcribing a true value into the data system.  He does it because he has fat fingers, or he puts things in the wrong field.  He is not systematically making the same errors.

On this basis, and this basis alone, do you, Mike still hold that we’ve got a problem?  (i.e., the bold part in Tango/30)

Because this is what Colin is limiting his discussion to as well, that we’ve got a problem even if there is no systematic bias to speak of.

And so, when Colin says this:

I’m simply asking if people can agree that it’s theoretically plausible that there is such a thing as data so riddled with errors that it’s more harm than help. If you think data matters at all, I don’t see how you can claim ... that there is no amount of random measurement error that would invalidate a dataset

I suppose it is theoretically plausible that the quality of (measurement and/or transcription of) data is so bad that it is worse than not having it at all.  That is, the data is so much junk that I’m better off knowing it was Felix/Vlad/Safeco and Ichiro picking up the ball, than to rely on a measurement error that puts the ball in short left field.

So, the question is how messed up can the data be that I’d be better off ignoring it all.

Furthermore, even if the data is messed up to a large degree, could I still maybe use it, and regress the sh!t out of it, and rely mostly on the fact that Ichiro picked up the ball, and therefore just ignore any ball marked as being hit to left field?

***

As for systematic bias, then all bets are off, and I won’t argue, because I agree with you.  Just a question of degrees of agreement, and how to correct the data.


#37          (see all posts) 2011/04/21 (Thu) @ 10:27

We have to define our terms very carefully here if we’re going to have a productive discussion.

By random error, I believe we are talking about error that would follow a normal distribution around a mean, where the mean is a perfectly accurate description of the ball’s trajectory in time and space.

I agree that if we magically remove all systematic biases like range bias, catch/no-catch bias, and observer location bias, that we’ll be statistically closer to describing the trajectory of the ball with batted ball location and speed data from this magically unbiased observer when he is below some level of random error than we are using information about the identity of the fielder who caught or picked up the ball and batter/pitcher tendencies and game state.  Above some level of random error the other set of information will be better than the magical observer.

Where I have the problem is when you try to turn this information into a fielding system.  At that point you start comparing this batted ball to other batted balls.  Then, it’s no longer sufficient to say that you have the random error low enough that your magically unbiased observer estimate on this single batted ball is better than estimate on this single batted ball from fielder/batter/pitcher identities and game state.  You also need to know that your baseline for comparison is better than the baseline for comparison that was produced from the other method, and that your mathematical method for comparison makes appropriate use of the data input to the system.

The error from all three of those pieces (trajectory of batted ball in question, accuracy of baseline, construction for comparison) is always in play when you are talking about the accuracy of a fielding system.  You can’t divorce the first piece from the other two, even when only talking about random measurement error.

Also, the interaction becomes more complex once systematic biases are added back in.  I think it’s important to recognize and remember that random error and systematic error are going to interact between the various pieces, and it won’t be easy to isolate the effect of one from the other.  However, I’m willing to set that aside for the moment for the purposes of discussion.


#38    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 10:40

Ok, setting aside the systematic bias as an issue for the purposes of discussion, then let’s discuss Mike’s two issues here:

Where I have the problem is when you try to turn this information into a fielding system.  At that point you start comparing this batted ball to other batted balls.  Then, it’s no longer sufficient to say that you have the random error low enough that your magically unbiased observer estimate on this single batted ball is better than estimate on this single batted ball from fielder/batter/pitcher identities and game state.

Right, so that’s the point I was making.  How bad of a pure measurement random error and/or pure transcription random error would you have to have in order for this to be worse than simply having the identities of the entities involved AND of course the transcription of who picked up that ball (because that too is going to be subjected to some random transcription error).

This I think can be easily (though time-consuming) determined with some simulation.

So, I agree it is worth investigating, just so that we can know the answer.

You also need to know that your baseline for comparison is better than the baseline for comparison that was produced from the other method, and that your mathematical method for comparison makes appropriate use of the data input to the system.

This I’m not worried about.  I think we can easily create, just based on our baseball guts and limited data, a reasonable out model based on hang time, angle, and distance.  For example, you figure that the average flyball is hit say 300 feet from home plate, with one SD equal to 50 feet (and with 4 seconds of hang time, with one SD equal to 1 second), and that a CF needs say 0.25 seconds of reaction time, he needs 4 seconds to accelerate to his top speed, and he runs at 19mph coming in, 18 going to his side, and 17 going back.  You start off with a model like that, and I think you can tweak it and come up with out rates of 95% within 40 feet of the starting point, and then going out from there to say 20% out rates at balls more than 100 feet away.

All numbers for illustration purposes only, but you don’t need much data to construct your model.  You can probably watch say 20 or 30 games, and have 3 guys watching the game independently, just to get your basic seeding numbers.  You don’t need numbers that are very accurate, just a general sense of numbers so that you use say 18/17/16 mph instead of 19/17/15 or something.


#39          (see all posts) 2011/04/21 (Thu) @ 10:48

Tango/38, so we’re not talking any more about how random error affects UZR or Plus/Minus, we’re talking about how it affects a system you might build from FIELDf/x data or that you’ve gathered from a very labor-intensive scouting process?  Because UZR and Plus/Minus don’t construct their out model the way you have described and don’t have the data that you are using to construct yours.  Note that I specifically said that the mathematical construction of the comparison method matters a lot.

If we’re going to talk about shortcomings in the UZR construction or Plus/Minus construction, let’s do that, or if we’re talking about how random error affects UZR, let’s do that.  Or we can talk about how to gather the ideal data set in a magical world without bias and see if that has any real world application.  But don’t start using a completely different fielding system construction in your examples in order to validate how UZR deals with random error.


#40    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 11:02

I wasn’t actually talking about UZR.  I was talking about the underlying data collection.

The question (or one question anyway) is how much random measurement error is there in observed data, if we limit ourselves to two sources of error:
- random observation error
- random transcription error

And, how does this error (at the game, season, career level) compare to the inferred error by limiting yourself to factual identity / environment data (and, presumably, the transcription of who picked up the batted ball hit, which also has a measurement error).

So, this is a question of data quality, and how bad that data quality has to be (due to those measurement errors noted) in order to ignore that observed data.

***

As an example, I figure about 2% of MLBAM batted ball data is pure junk.  Completely wrong.  Even if it wasn’t, I can spot that 2% easily enough to just throw it out.  I don’t need it.  I presume there’s another say 3% or 7% that is also pure junk, but that I can’t tell.  But, what if there’s 50% of 70% that is pure junk, then what? 

Can we still use all that data, flag the 2% to throw out, and treat the other 98% as possibly contaminated, and heavily regress that data?

And, if we heavily regress that data, do we now end up knowing less, than if we simply relied on Felix/Vlad/Safeco?

Sure, that’s possible.

But, couldn’t we use Felix/Vlad/Safeco AND heavily regress the contaminated data?  That should probably help, but could it make it worse (overall).  I suppose that’s possible (but I doubt it).


#41    Colin Wyers      (see all posts) 2011/04/21 (Thu) @ 11:04

MGL, how good do you have to be at counting cards in blackjack to make more money doing it than you would make by playing competent blackjack without counting cars?


#42          (see all posts) 2011/04/21 (Thu) @ 11:11

Tango/40, I don’t see how you can ignore the construction of the specific fielding system if you’re attempting to evaluate the accuracy of the resulting fielding system.  How a fielding system is designed to deal with the actual quality of the data it has is very important.

do we now end up knowing less

Knowing less about what?  About the trajectory of that batted ball, or about what that batted ball says about the performance of specific fielders?  Those are two very different things, but you seem to be freely going back and forth between them.


#43          (see all posts) 2011/04/21 (Thu) @ 11:16

As an example, I figure about 2% of MLBAM batted ball data is pure junk.  Completely wrong.  Even if it wasn’t, I can spot that 2% easily enough to just throw it out.  I don’t need it.  I presume there’s another say 3% or 7% that is also pure junk, but that I can’t tell. 

As an aside, the MLBAM batted ball data quality seems a lot better than that to me.  Do you have a recent example of a batted ball where their data is pure junk in terms of what they are trying to do (mark where the ball was fielded, not where it landed or first touched a player)?  Most of the weirdness I have observed comes from when a ball was deflected and picked up somewhere far from the deflection location.

You also mention transcription errors.  Do you have an example of any of those?  I’ve never noticed one.


#44    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 11:23

About the characteristics of that batted ball.

Let’s say we have this as an illustration.  Let’s say that by using FIELDf/x or some other advanced tracking system (or god) to establish the quality of the data, we end up with the following as our random errors if we rely on human observers:

1. Measurement error is random, with 1 SD = 20 feet, hang time is 1 SD = 0.5 seconds, and angle is 1 SD = 5 degrees.

2. Transcription error is random, with 5% established as pure junk.

3. No systematic biases.

So, that’s what we have.

Now, someone else, say me, comes along, and says: “Whatever, using WOWY, meaning purely factual data, just the identity of the entities, and the environment (and who picked up the batted ball hit), I can infer the distance, angle, and hang time of each batted ball.”

The question on the table: how bad of a measurement error in the human observed data do we need in order for WOWY (or purely factual data) to be worse at the play, game, seasonal, and career level?

If the SD was 100 feet, the hang time was 3 seconds, and the angle was 20 degrees, is that worse than purely factual entity/environment (i.e., WOWY) data?  I would think, yes, that’s definitely worse at least at the seasonal level, probably at the game level, and possibly at the play level.


#45    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 11:26

Do you have a recent example of a batted ball where their data is pure junk

I deal with Minors and Majors, so I’m not sure if I only looked at majors how bad that is (probably 0.5% junk).  I haven’t done anything this year, but I can look at it for last year.  So that we don’t have too many threads going on here, send me an email, and we can continue that offlist.


#46    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 11:33

To further elaborate my point, we have the Felix/Vlad/Safeco/Ichiro-marked-as-picked-up batted ball.

The human observer might have marked it as 270 feet, +20 degrees, 3.5 seconds hang time.  Or he might have marked it as 150 feet, -5 degrees, 6.7 seconds hang time.  Or some other wild number.

So, I think I can do better at the play level than the human observer, if the human observer has that much of a measurement error.  I can think that Felix doesn’t give up alot of FB, and if he does, they’re not long.  I can think that Vlad does get deep flyballs, but also alot of high flyballs.  I can think that if Ichiro didn’t get an out, then it was probably hit in the gap somewhere.

Therefore, on a single play like this, maybe I’ll be off, using just purely factual data, by say 10 degrees, 50 feet, and 2 seconds of hang time, relative to what actually happened.

So, depending on how bad the measurement error is, I could theoretically infer more about what happened in this one instance than what the human observer said he saw.  (i.e., circumstantial evidence preferred to eye witness testimony).

Therefore, that’s kind of where we are at: how bad a human observer do you have to be, and how bad of a transcriber do you have to be for you to tell us less than us coming to a conclusion based just on factual data (that we infer from).


#47          (see all posts) 2011/04/21 (Thu) @ 11:39

I think I agree with Tango/44+46 insofar as we are only talking about the description of one batted ball and not about assigning any fielding values (expected out rates, etc.) to it.


#48    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 11:50

Mike/47: Great.  But that would be the next step!

So, now, using the factual data only, we infer that that particular batted ball should have been an out 10% of the time.  That is, if you have Felix give up a FB and Ichiro is in RF, and a hit occurred, chances are that that ball was uncatchable to a great extent.

On the other hand, if we develop a decent out-model (i.e., MLB PlayStation3), we can feed it human observation data.  And in this case, that data point says 40% out rate when fed into our video game model.  But, it has a huge measurement error range, because the human recorder is noted as sloppy (high SD in all his recording numbers).  The out rate for the human recorder is 10% to 70%.

So, at the single play model, we may very well be better off ignoring the human observer.  But, what if we have 900 batted balls?  The error range decreases proportional to the square root of the number of observations.  So, if our random error was an out rate of 30%, suddenly with 900 batted balls, the out rate range is now only 1%.

Hence the power of sample size to knock out measurement errors.


#49    Colin Wyers      (see all posts) 2011/04/21 (Thu) @ 12:31

So, at the single play model, we may very well be better off ignoring the human observer.  But, what if we have 900 batted balls?  The error range decreases proportional to the square root of the number of observations.  So, if our random error was an out rate of 30%, suddenly with 900 batted balls, the out rate range is now only 1%.

...so you agree with me that the utility of the batted ball data is a function of the sample size?


#50    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 13:06

Colin/49: the utility of ANY dataset is a function of the sample size, yes.

***

Are we agreed that, systematic bias aside, that sample size is our friend with regards to random error?  And if we only had random error to worry about, we wouldn’t really worry about it at the seasonal level, unless the random error was huge (and, frankly, so huge as to be hard to believe)?

***

Because, now we can talk about systematic bias if that’s the case.  And in this case, sample size is the enemy of systematic bias.

Consider Tim Raines and his seasonal RDI (RBI-HR), runners driven in, totals.  You have a 60 here and 51 there, and you think he’s an alright hitter.  Nothing great.  But then you look at his career total and see he has driven in only 810 runners in 10,000 PA.  It’s a fairly average number, and you might scratch your head how a good hitter like Raines could have such a low number.  Maybe he hits poorly with men on base?  But then you see he actually hits better with men on base.

So, what’s left?  What’s left is a systematic bias.  He spent the majority of his career as a leadoff hitter.  And so, the opportunities to drive in runners did not present themselves.

The sample size in this case exacerbated the problem.  If you don’t account for this systematic bias (batting order), you end up with a hugely flawed metric.

So, where in one case sample size is a big friend of random errors, sample size is a big enemy of systematic biases.

The idea is that you would account for these biases in some manner, so that you can remove its effects.  That’s if you know how to look for them, and how to adjust for them.

In some cases, we simply don’t know or can’t know.

***

So, anyone wanting to highlight the problems with batted ball data, or metrics that rely on batted ball data, you need to put your entire focus on the issue of systematic biases.  Because that’s an argument you can win (Andruw Jones being Exhibit A… you actually couldn’t have asked for a better example).


#51    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 13:13

For those coming late to the Andruw Jones thing:

http://www.insidethebook.com/ee/index.php/site/comments/suzr_v_buzr/

I matched almost all of the players from MGL’s STATS-based UZR (heretofore called sUZR) to his BIS-based UZR (bUZR).  Here are some findings, combined from 2003-2008:

Andruw Jones: He has 819 “games” in sUZR and 731 “games” in bUZR.  The way MGL counts games is that he bases it on the expected number of outs of when he plays divided by the expected number of outs for the average CF in a game.  (So, if he has say 2400 expected out, and the average CF gets 3 outs per game, then this counts as 800 games.) Jones has an enormous difference.  The result?  In sUZR, he’s a collective -5, and in bUZR he’s the best fielder in baseball from 2003-08, at +112 runs.  Clearly, this is not some random bias here, but a systematic difference in how a STATS scorer and how a BIS scorer is scoring plays with Andruw Jones on the field.  Remember, MGL is using an identical engine.  It’s the classic GIGO. 

That’s an example of a systematic bias.  Maybe because Jones positions himself more shallow, the scorer consistently (systematically) biases the data so that the actual distance is 10-20 feet from the observed difference.  Perhaps the FB/LD split is different because of his positioning.  Maybe simply making the out causes different biases. 

It could be alot of things.  What we do know is that the result was a 117 run swing in six years.  And it affected a fielder that would either be considered the best fielder ever, or an average fielder.

That’s where you hang your systematic bias hat on.


#52          (see all posts) 2011/04/21 (Thu) @ 13:14

Tango, my agreement with you ended with what I stated in #47.  You make huge leaps to get from there to where you are in #50.

The details of the out model matter a lot, for instance.  Instead, you’ve waved your hands and assumed a perfect out model.  It doesn’t work that way.


#53          (see all posts) 2011/04/21 (Thu) @ 13:18

If the question is whether I believe that systematic bias is a huge problem for the batted-ball-trajectory fielding metrics, then certainly I agree with that.

If the assertion is that random error is a minor problem for those metrics, then I don’t think that’s been shown.

I thought the question we were trying to answer was whether UZR is better in small samples than nFRAA/DRA style metrics.  I don’t think we’ve made much progress on that question.


#54    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 14:02

If the assertion is that random error is a minor problem for those metrics, then I don’t think that’s been shown.

If you have random error, and only random error, you solve the problem with more data.  The amount of data you need is going to depend on the size of the random error.

As for whether it’s “minor” or not, well, you are correct, we haven’t shown that.  We’d have to know how much random error there is in the data.

My guess is that this particular issue (random error) is definitely minor.

***

The details of the out model matter a lot, for instance.  Instead, you’ve waved your hands and assumed a perfect out model.  It doesn’t work that way.

Yes, the details matter, but I see it alot like a baseball video game.  If I were to design a realistic baseball video game, I think I could come to, if not perfect then, a pretty good model.  I certainly would have no concern about this being some limitation.

Basically, how hard could it be to create a model that has batted ball parameters and running player parameters to see how often a ball is going to be caught?  Sounds like a breeze to me.

***

Anyway, if the question that is (now) on the table requires us to also consider systematic bias (and, we definitely need to consider this at some point), then, yes, we’re still a long ways from answering that.

Which is why, I think, that before we bother talking about systematic bias, that we should close the book on all the other issues.  Those are solvable and resolvable.

The issue of systematic bias has many parameters to consider, and is a hard one to try to find any resolution on.

This is not to say that we shouldn’t do it, but we should put a big asterisk saying “* source data subject to occasional and possibly high levels of systematic bias”.


#55    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 14:19

I should note that I jumped the gun by talking about systematic bias before everyone had a chance to have a say on the random error issue.


#56    Colin Wyers      (see all posts) 2011/04/21 (Thu) @ 14:31

Are we agreed that, systematic bias aside, that sample size is our friend with regards to random error?

Yes.

And if we only had random error to worry about, we wouldn’t really worry about it at the seasonal level, unless the random error was huge (and, frankly, so huge as to be hard to believe)?

No.

I mean, it COULD be true, depending on (off the top of my head) the amount of random measurement error, the methodology used to process the data, the precision (as opposed to accuracy) of the data, and the question you were trying to answer. Without knowing those things I don’t see the point in trying to guess at the magnitude of the effect.


#57    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 14:56

In terms of the measurement error of:

1. Distance: I’m guided by Peter Jensen’s fantastic article a few years back.  It was one SD = 10 or 12 feet when comparing observer to observer.  I think we can safely say that one SD will be at most 20 between observer and true, if limited to random error.  Basically, if trying to guess on a distance of a batted ball, two-thirds of the time, we’ll get it to within 20 feet, and 95% of the time within 40 feet.

So, if you’ve got 400 outfield balls, the average error will be one foot.

No big deal at all.  Which is why just saying “short, medium, deep” is quite satisfactory.  (Again, systematic bias aside.) After a few hundred batted balls there, you get a great match between the observed estimate and the true.

2. Angle: I’ve done a bit of work, and I think 1 SD = 5 degrees is about right.  As a point of illustration, 3B to SS are positioned about 20 degrees apart, the 1B to 2B the same thing, and SS to the base and 2B to the base also about the same thing. 

So, presuming that 68% of batted balls are within 5 degrees of the estimate, and 95% are within 10 degrees also sounds about right.

Setting aside systematic bias, if you’ve got 400 groundballs, well, you can see how we can pretty much nail where a ball is sprayed.

Once again, just having those limited Retro zones do a pretty good job in giving us the actual spray, given enough groundballs.

(As always, setting aside systematic bias.)

3. Hang Time: It’s insane that it’s not recorded, and still not recorded.  The most important of all the batted ball characteristics and the easiest to record.  Not to mention that the measurement error on this one is going to be the smallest.

Presuming you are consistent in how you start and stop your stopwatch (and again, no systematic bias), I doubt you need much sample size at all to get it to 1 SD = 0.1 seconds of random error.  I think having 1 SD = 1 second is outlandishly a large estimate, so just 100 batted balls is all you need to get it to 0.1 seconds of random error.  I would guess you’d need more like 25 batted balls.

In terms of transcription error:
4. Well, let’s say that 5% are pure junk.  At the MLB level, I’m inclined to think it’s more like 1% maybe 2% (and much higher in the minors).  But, let’s say that 5% is junk.

***

So, in the face of my presumption here, my opinion is that the random error is not an issue with batted ball data.  Sample size is a godsend in wiping out what are seemingly large random errors on single batted balls.


#58    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 14:59

Without knowing those things I don’t see the point in trying to guess at the magnitude of the effect.

The point would be that you create a reasonable model to use reasonable data rather than discarding all data because you couldn’t come up with a decent estimate of uncertainty.

This could simply be a philosophical issue.  You don’t see the point, and I see the point.  Not to say that either one of us is wrong.  It’s just how we roll(*).

(*) Is that term already outdated?


#59    MAH      (see all posts) 2011/04/21 (Thu) @ 16:17

Tango/57:  What were the “observers” in Peter Jensen’s article observing?  Were they observers working from the same vantage point or different vantage points?  For example, were they two people interpreting the same video feed, or two stringers sitting in different parts of the ballpark?

Colin/56:  I agree it _could_ be true that purely random _measurement_ error under UZR averages out over the course of a season.  But so could _estimation_ error under a DRA system with 2003-10 data, possibly to the same degree.

The essential question, before we move on to the more important issue of systematic biases, is finding a _formula_ for estimating the _rate_ at which batted-ball and DRA systems converge toward the ‘true’ number of expected outs per position per batted ball in play.

Obviously, a randomly distorted batted ball system will provide a much better estimate of the expected outs at each position given one batted ball.  It is also clear that if the distortion is truly random, its effect should decline with the number of batted balls.

Obviously, DRA with 2003-10 data will provide a weak estimate of expected outs at each position.  However, as the number of BIP go up, the estimate of expected outs given those BIP will also converge on the right number.

If we say that the convergence is a simple function of the number of BIP, say, the square root of BIP, then since the batted ball systems starts with an advantage, it never loses it, no matter how many BIP.  Maybe that is what MGL has been trying to say.  However, I do not know whether the convergence functions for such different kinds of estimators are as simple as that.

Let’s say, just for the sake of argument, that theoretically the formulas for convergence would show that a theoretically unbiased UZR is more accurate than DRA, over any time frame.  Then the practical question is, how _much_ more accurate would the theoretically unbiased UZR be over time frames considered most relevant for fans and management--a year, two years, maybe three years--that is, the time frames used to project players under Marcel, for example.

This leads back to Tango/28.  There Tango said that he has “said the reliability of UZR after 100 games was equivalent to the reliability of wOBA or RC after 50 games.” I don’t see that he said that in this thread prior to comment 28.  I seem to recall him saying similar things over the years.  However, I also recall him stating quite emphatically, “Don’t talk to me about UZR unless you’re talking about two years of data.” It was from a thread about the best fielding first basement over a three-year period.

Which leads to the practical question: If our confidence about UZR (which is always above zero and rises per some function) doesn’t reach a level of _practical_ reliability for two years, is there any meaningful difference between the theoretically unbiased noise left in UZR and DRA over that same time period? 

And if we can never actually know the amount of random and systematic bias in batted ball systems, shouldn’t that inform the degree to which we consult defensive runs estimates with and without batted ball data?


#60    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 16:27

Again, the point about the two-years is that if you are going to make a comment about a hitter after one year, then you should make the same comment about a fielder after 2 years.  It’s a conditional statement.  Reliability goes from 0 to 1 as n goes from 0 to infinity (bias aside).

If you want to talk to me, I’d like to have one year of hitter performance, and therefore that means I’d like to have two years of fielder performance.

YMMV.  Maybe you like to talk about hitters after half a season, so in that case, you can talk about fielders after one season.

***

Peter’s fantastic article was here:
http://www.hardballtimes.com/main/article/is-seeing-believing/

Discussion of that article was here:
http://www.insidethebook.com/ee/index.php/site/comments/cross_checking_the_data_providers/


#61    MAH      (see all posts) 2011/04/21 (Thu) @ 17:29

Fans like to talk about player performance for all sorts of short time frames, but the record books look at one season.  Two seasons for an unbiased batted ball system sound right by me as well.

So the question for me at least is: how much more accurate at the end of two years is a randomly distorted batted ball system over DRA using current Retrosheet data?

Scanned Peter’s article.  As I understand it, because he wanted to include Gameday data, he only included _outs_.  What’s funny about that whenever there is an out, Retrosheet records who made it and what position they were playing.  It would be an interesting exercise to take the same events and _assign_ the batted ball location based on the usual place in the field the fielder at that position occupies, and find out how ‘wrong’ that estimate is compared to all the others Peter saw.

It also occurs to me that in the cases of outs, all the observers share a common ‘anchor’ for their estimated coordinates: the coordinates of the player standing at his ‘normal’ position.

By far the most important balls to code are those that difficult but not impossible to field.  Half or more will go through as hits and not appear in Peter’s test.


#62    Colin Wyers      (see all posts) 2011/04/21 (Thu) @ 19:15

The point would be that you create a reasonable model to use reasonable data rather than discarding all data because you couldn’t come up with a decent estimate of uncertainty.

Discarding what data? The question was about data without bias but with random measurement error, right? I’m certainly not talking about discarding that - I mean, the data potentially exists (there’s stuff like Trackman in the world, and Matt Thomas’s photogrammetry data) but for the purposes of this discussion it’s all purely speculative. If and when I ever get enough of that sort of data to do meaningful analysis with, I’ll certainly want to know how random measurement error affects what I’m doing. But I don’t see the need to try and prejudge it until I’m in that situation. Why not keep an open mind?


#63    Tangotiger      (see all posts) 2011/04/21 (Thu) @ 22:19

I’ll certainly want to know how random measurement error affects what I’m doing. But I don’t see the need to try and prejudge it until I’m in that situation. Why not keep an open mind?

I already laid out what we can reasonably expect in terms of random error in Tango/57, and any half-decent size of sample data would make the data extremely useful.

So, I’m not prejudging it.  I analyzed it, I laid it out on what we can reasonably expect, and we don’t need to have much data to make the case that the model I presented is reasonable, and so, sample size will greatly reduce the sample error to almost nothing.

My mind was open, I thought about it, and I formed an opinion based on a reasonable model. 

Indeed, until I laid it out in Tango/57, I didn’t realize how little sample size we needed.  So, thanks to you for bringing up this issue, I’m even more convinced than ever that the random error is a huge non-factor here.

***

Systematic bias though, that’s the killer.


#64    MGL      (see all posts) 2011/04/22 (Fri) @ 04:07

I just got back in town and skimmed through the last 10 posts or so.

“MGL, how good do you have to be at counting cards in blackjack to make more money doing it than you would make by playing competent blackjack without counting cars?”

That is a very good and relevant question.  First, you can’t make any money at blackjack by counting cars.  You might be able to construct a traffic model though.

Seriously, this IS exactly the same question we are debating.  Of course you can’t get an edge in a typical BJ game without counting cards (or some other method of obtaining an advantage).  You are typically at around a .5% disadvantage playing perfect basic strategy.  If you count cards horribly, but do not make any systematic errors (bias), you will ALWAYS do better than playing perfect basic strategy and not counting cards, as long as you “regress” your strategy appropriately.  By that, I mean vary just enough from basic strategy that is appropriate to the level of accuracy in your counting.  For example, if you count cards horribly, but occasionally correctly, as long as you still play basic strategy but you vary your bets according to your really bad count, again, you will ALWAYS do better than not counting and also playing perfect basic.  That is a mathematical certainty, and anyone who counts cards and knows a little about the mathematical theory behind card counting will tell you the same thing.  The reason is simple.

If you play perfect basic strategy, your disadvantage will be around .5% for a typical 6 deck Vegas game.  That is true no matter what you bet on any given hand (as long as you are not betting more or less when the count is in or against your favor).  Your win expectancy will simply be minus .5% times your total amount bet.

If you count cards well, you will have on average, an additional edge of .5% per true count (for certain commonly used count systems - for other count systems it might be .3% or .25% per true count or whatever).  If the true count is +3, your advantage is 1.5% minus the initial .5% disadvantage, or a 1% advantage. If your true count is -3, then you have your “off the top” disadvantage of -.5% plus another disadvantage of 1.5% for a total disadvantage of 2%.  You will always have the same number of plus and minus counts if you play every hand - they occur symmetrically.  So if you bet more when you have plus counts (and an advantage) and less or nothing when you have minus counts, you can turn that flat betting disadvantage of -.5% into an overall advantage of whatever.  That is the practical theory of card counting.

Anyway, if you count cards with zero accuracy, but no bias, it won’t change the -.5% disadvantage.  When your raise or lower your bets because you think you have a plus or minus count, you are as likely to have a plus or minus count regardless of what you think it is, and on the average no matter what you think it is, it will be zero.  You think you are counting, but you are doing nothing.  Your count has zero correlation with the “real” count if someone were actually counting.

However, if you have some accuracy whatsoever, say, one out of every 100 shoes, you are actually counting for real, or even one hand out of every 100, and you raise and lower your bets according to the count on every hand and every shoe, 99% of the time, you will still have the -.5% disadvantage, but 1% of the time, you will have an advantage.  So overall you will ALWAYS do better with counting as long as you are playing basic strategy and not varying your playing strategy according to the count you think you have.  If you do that (vary your playing strategy according to the count you think you have), you will likely do worse than not counting at all and playing perfect basic strategy.  However, if you have an idea as to how bad or good you are at counting, even if you are not perfect, you can vary your basic strategy accordingly and do even better than varying your bets and playing basic strategy.

For example, let’s say that you are counting with some accuracy but not very much (say 10% of the time you are accurate and 90% of the time your count is completely random).  If you varied your strategy according to what you thought the count was, you would probably do worse than not counting, because most of the time your play would be wrong when you varied from basic strategy.  For example, when you thought that the true count were +3 or higher, you would stand on a 12 versus a dealer 2, which is correct when the true count is actually +3 or higher.  But if you are not very accurate in your counting, when you thought the count was +3, it would really be like +.5 on the average (or whatever), and standing on a 12 versus a dealer 2 is NOT correct at an average count of +.5.

But, with your bad counting, you would be safe always standing on a 16 versus a 10 when you thought the count was plus and hitting a 12 versus a 4 when you thought the count was negative (both of these plays are not basic strategy) because the thresh hold numbers for these variations from basic are both zero.

Same exact thing with our bad data.  As long as you handle it correctly (like not varying from basic strategy more than you are supposed to given your estimation of how bad you are counting - and if you don’t know that, simply play basic and only vary your bets which can never be incorrect on the average), you ALWAYS do better than not using the data at all!

Tango (and everyone else), OF COURSE you can do better using non-batted ball data than bad data, and there is some point where the results are equal, but that is NOT what I have been talking about.

I am talking about ADDING the batted ball data to your best non-batted ball system.  I have mentioned that several times.  You will always do better as long as you don’t screw up how you handle the bad batted ball data (with no bias and SOME accuracy - not 100% bad). 

Mike, if you guys are going to have a discussion about this with the assumption of no bias, how does it help to keep whining about the “magical world where no bias exists?” We ALL agree that a certain amount of bias even with some accuracy can invalidate the usefulness of an entire data set.  We also ALL agree that nowhere in the universe does there exist a data set without bias.  If you don’t want to discuss the hypothetical “no bias”, then don’t.  You don’t need to keep reminding everyone that it only exists in a “magical world.” That is a childish straw man…


#65    MAH      (see all posts) 2011/04/22 (Fri) @ 07:21

Tom/63/57:  I’m not yet ready to buy the analysis on random error.

First, it elides what are probably greatest sources of random error in _current_ batted ball systems by far: _both_ the coarseness and misclassification of trajectories of balls hit into the air, compounded by whatever randomizing or distorting impact the “hard, medium, and soft” factors add.  I agree that _if_ we had hang-time, which I have _long_ agreed with you is essential, and credit you in Wizardry with proposing, the random errors would likely become trivial in the overall model.

Second, while I appreciate that you’ve expanded the error bound on Peter’s study, that study, as explained above, would have a systematic error towards minimizing differences between estimates, because it is based solely on outs, where the ‘true’ events would be bunched together and a collective bias to ‘anchor’ the location code closest to the standard position would likely be present.  Also, were those differences solely in the outfield or averaged to include infield ground outs? 

Third, while I am inclined to agree with you that with 400 ground balls you know where the “balls were sprayed” (I think you meant the plural) it is still not clear to me that having that recorded spay chart would provide a _meaningfully_ better estimate of expected outs at a position than DRA 2003-10 using a full season of data (which I think corresponds to your 400 ground balls estimate, if you are thinking of third base). 

I would not say it is per se unreasonable to believe that moderately randomly distorted batted ball systems with no significant biases as constructed today “always” beat DRA, but still no systematic quantification or rule of thumb formula has been developed to show this.  I furthermore do not believe the case has been made that a theoretically massively randomly distorted batted ball system always beats DRA.

I suppose we’ll only be able to answer this question in the distant future when historic FIELDf/x data is released, we re-discretize it to ‘look’ like zone-like data with coarse trajectory codes, randomize the codes by varying amounts, and find out how well DRA matches those.


#66    tangotiger      (see all posts) 2011/04/22 (Fri) @ 07:43

Tango (and everyone else), OF COURSE you can do better using non-batted ball data than bad data, and there is some point where the results are equal, but that is NOT what I have been talking about.

I am talking about ADDING the batted ball data to your best non-batted ball system.  I have mentioned that several times.  You will always do better as long as you don’t screw up how you handle the bad batted ball data (with no bias and SOME accuracy - not 100% bad). 

I think I must have noted elsewhere that there’s nothing stopping us from including the (non-systematic bias) batted ball data, in addition to all the identity / environment data.  That even if 50% of the data is pure (random) junk, we’d still want all that data, and we’d just regress alot.

HOWEVER, I was trying to show the magnitude of the random error relative to inferring it using identity / environment data.  And if you had the choice between the two, which of the two you’d want to choose.

So, I disagree with you that “OF COURSE” you can do better in other methods than relying on a dataset riddled with bad data.  Obviously, it depends on how much bad data you actually have, but that’s the point of the exercise.  I don’t think we’re anywhere close to having a situation where the randomness to the data is so bad that we could do better with identity/environment data and inferring batted ball locations.

So given that, I think we’ve talked about the random error issue far too much.

However, MAH said that it’s not settled law yet, so I guess I’m in store for another long day.


#67    Peter Jensen      (see all posts) 2011/04/22 (Fri) @ 09:40

MGL - I think your post #64 was an elegant and persuasive description of your position, with which I happen to agree.


#68    MAH      (see all posts) 2011/04/22 (Fri) @ 11:00

"Of course you can’t get an edge in a typical BJ game without counting cards (or some other method of obtaining an advantage).  You are typically at around a .5% disadvantage playing perfect basic strategy.”

That’s not the premise I’ve been thinking of, and I don’t think it’s the premise Colin has been thinking of.  Souped-up DRA _should_ provide a ‘perfectly’ unbiased estimate (keeping in mind MGL’s sensible point that _nothing_ is perfectly unbiased).  It will, however, presumably have some noise over the course of a year or two.  It’s more like having a guy who _does_ count cards to get toward that .5% advantage, but he’s not perfect.

But let’s say he can team with _another guy_ (the batted ball data provider) who can ‘chip in’ his ‘counts’ as well.  What if the second guy is also unbiased, but worse than the first guy by himself.  Is the first guy better off consulting with the second guy?  Or is he just adding variance?

Going back to the park factors analogy, if the first guy had primitive but unbiased factors, what would be gained, in theory and in practice by adding Colin’s noisy estimates.  Perhaps in theory the mean estimate is infinitesimally better, but the resulting numbers are not useful.


#69    MGL      (see all posts) 2011/04/22 (Fri) @ 12:30

"So, I disagree with you that “OF COURSE” you can do better in other methods than relying on a dataset riddled with bad data.”

I don’t think you disagree. I worded that sentence poorly.  I meant that of course there is a point where the data is so bad that the non-batted ball system will be better.

MAH, what’s with the “_” in your posts?


#70    MAH      (see all posts) 2011/04/22 (Fri) @ 13:17

MGL, have never gotten the knack of formatting block quotes so I just put “ “ around what you said.

Just let me know how to do it and I’ll follow the instructions.


#71    Tangotiger      (see all posts) 2011/04/22 (Fri) @ 13:19

[quote]This is a quote[/quote]


#72    MGL      (see all posts) 2011/04/22 (Fri) @ 15:28

Michael, I meant the underscoring’s…


#73    studes      (see all posts) 2011/04/22 (Fri) @ 16:33

I’m surprised no one has brought up the most glaring error in MGL’s original post:

Here is a quote from this article entitled “Batted Balls and Home Runs” by Studes on BP.

BP?

And, yes, that is the most constructive thing I can add to this conversation.


#74    MAH      (see all posts) 2011/04/23 (Sat) @ 00:21

I had heard that _underscoring_ was a more polite way of emphasizing a word than CAPITALIZING it.


#75    MGL      (see all posts) 2011/04/23 (Sat) @ 03:25

Never knew that..


#76    MGL      (see all posts) 2011/05/09 (Mon) @ 02:41

Does anyone have the limited field f/x data that was made publicly available that they can send me?  I am going to do an analysis of range bias as part of my UZR presentation at the Boston conference…


#77    Colin Wyers      (see all posts) 2011/05/09 (Mon) @ 04:09

MGL,

I have it. You can e-mail me at pontifexexmachina at gmail dot com and I will try and deliver it to you.


#78    Peter Jensen      (see all posts) 2011/05/09 (Mon) @ 07:44

Colin - NDA?

MGL - Even if Colin gets you the data good luck with that.


#79    Tangotiger      (see all posts) 2011/05/09 (Mon) @ 09:02

There wasn’t any FIELDf/x made publicly available was there?  There was one month of HITf/x that was made available, so maybe MGL is conflating two things?


#80    Peter Jensen      (see all posts) 2011/05/09 (Mon) @ 09:21

Tango - No, the Field Fx data was not public but was by invitation to 11 analysts and subject to an NDA and the results only for presentation at the Summit. Perhaps Colin will be able to get permission from Sportvision to share the data with MGL.  But my experience with getting an exception to the NDA was that it took months.

Hit Fx, possibly, but Hit Fx data doesn’t help much in analyzing range bias because of the magnus effect changing the path of the ball after the initial parameters.


#81    Colin Wyers      (see all posts) 2011/05/09 (Mon) @ 09:36

There was one month of HITf/x that was made available, so maybe MGL is conflating two things?

I don’t know if MGL was, but I was.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 25 00:36
Help needed with sticky issue…

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards

May 24 08:13
espnW for hockey: CBC’s WhileTheMenWatch.com