THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, November 08, 2011

Rates without Sample Size

By Tangotiger, 11:44 AM

I agree with Matt wholeheartedly.

***

I’ve had a minor issue with Pizza Cutter’s threshold for “stabilization”, which I’ve mentioned several times in this blog.  Basically, Pizza sets the threshold at r=.70, whereas I set the threshold at r=.50.  Why do I prefer mine?  Because with my threshold, I can tell you exactly how much to regress the stats.  It gives you extra information.  In addition, I can explain it in English.  If I set the OBP threshold at PA=210, then I can say: “If the player has 210 plate appearances, then his OBP is half real and half noise.  Regress his OBP by 50% toward the mean.”

And, if the player had 500 PA, then you would regress by 210 / (210 + 500) = 30%.

For Pizza, r=.70 would mean THE EXACT SAME THING.  But his threshold would be PA=500.  So, his threshold say: “If the player has 500 plate apperances, then his OBP is 70% real and 30% noise.  Regress his OBP by 30% toward the mean”.

So, exact same thing.  But, if the player had 400 PA, then what?  Well, in my case, you know exactly how much to regress by: 210/(210+400) = 34%.  But with Pizza’s case?  You’d have to do: 1-400/(400+.3/.7*500) = 34%.  That 3/7ths thing there is not very attractive to me.

Pizza is as stubborn as I am, because we both knew exactly what the other guy meant, and still, both of us stuck to our guns on this issue.

Note: no actual pizzas were hurt in the creation of this post.

***

Derek Carty posted the 50% threshold here:
http://www.insidethebook.com/ee/index.php/site/comments/when_is_the_observed_data_half_real_and_half_noise/


#1    Lex Logan      (see all posts) 2011/11/08 (Tue) @ 12:25

The problem I have with Pizza Cutter’s thresholds is that all of those stats are proportions, and the standard error of a proportion depends only on the proportion and the sample size, if the events are independent. So the thresholds should all be EXACTLY THE SAME, or we must believe that the events are NOT independent: “I’ve been walking a lot lately, think I’ll hack away more!” “This guy’s been striking out a lot, I’ll just groove one over the middle and see if he looks at it!” In other words, if you set up a simulation, you should NOT see these varying thresholds, and if they exist in real baseball, we ought to try and understand why.

I agree completely with always reporting the PA’s with rate stats; but I’m not ready to endorse tacking on Pizza Cutter’s threshold levels.


#2          (see all posts) 2011/11/08 (Tue) @ 12:27

My big issue is not with .7 vs. .5; Tango has his reasons for wanting R to be 0.5 and Pizza has his reasons for wanting R^2 to be 0.5 (that’s the only reason we care about 0.7 in this context).  I happen to side with Tango (today at least), but that’s not my point.

My issue is the appearance that “stability” is a binary condition...that certain stats are “unstable” before the threshold and “stable” thereafter.  People want simple yes/no answers for whether or not certain stats can be trusted, but the truth is a far more nuanced answer.  Every single plate appearance gives *some* small amount of information, and these studies help us to find out just how much information that is.


#3    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 12:44

Lex/1: I have no idea what you just said. 

To the extent that I think I know what you just said (and I’m focused on the “EXACTLY THE SAME” declaration), then you are wrong.

I’m interpreting that to mean that the threshold for OBP and for SB and for K should all be the same, under some condition.  Is that what you are saying?  If so, then that’s why I say you are wrong.

On the other hand, if I completely missed what you said, please try again.


#4    Lex Logan      (see all posts) 2011/11/08 (Tue) @ 12:49

Just scanned the “Solving DIPS” pdf (linked in comment #23 to the Derek Carty thread) in which Arvin presents the formula Observed Variance = Binomial Variance + True Variance (is that “True Talent Variance ?) My comment #1 above refers to the Binomial Variance, i.e., what you would get from a simulation where all events were independent. So, apparently, the discrepency between Pizza Cutter’s thresholds and what you would expect from independent binomial (two-outcome) events is in that True Variance. But I don’t understand the mechanics of that.


#5          (see all posts) 2011/11/08 (Tue) @ 17:07

This is only an issue if you don’t include error bars on your estimates.  Of course, if you do include error bars on your estimates, you’ll notice how often the discussion is about things that have overlapping error bars.

This is also an argument for statistics that scale with significance (signal-squared over total sample) rather than rate (signal over total sample) or count (signal).


#6    Lex Logan      (see all posts) 2011/11/08 (Tue) @ 20:56

Tango, regarding r = .5 vs. .7, you say that r = .5 means you can say that a result is half real and half noise. My stats training is that an r^2 of .5 allows you to say that, which I recall is exactly why PC chose r = .7. To be specific, the Statistics books I use for tutoring say that in a linear regression, r^2 represents the percent of variation explained by the model.


#7    Lex Logan      (see all posts) 2011/11/08 (Tue) @ 21:20

Tango/3: Here’s what I understand: If I set up a simulation and assign, say, a strikeout rate of 5% to a player, I’m going to observe actual results that will vary due to chance alone. I assume that’s what Arvin called Binomial Variation. If I run the simulation without knowing the 5% (the player’s “True Talent") I can construct a confidence interval around the observed rate, using well-known formulas. If I observe a rate of 5% on 100 PA’s, I can be 50% confident the true rate is between 3.5 and 6.5%, and 95% confident it’s between .7 and 9.3%. On 1000 PA’s, I’d be 50% confident of a rate between 4.5 and 5.5%, and 95% confident of 3.6 to 6.4%. The formula does involve the rate, so for something like BA, an observed rate of .300 would imply .269 to .331 (100 PA’s, 50% confidence.) The point is that the Binomial Variation will not differ enough to explain PC’s vastly different “stability” levels.

So, there must be more going on here, i.e., real baseball isn’t a simple simulation where the batter’s outcomes depend only on some pre-set “True Talent” level. Here are some possibilities:

(1) events are not independent
(2) the talent level of opposing pitchers varies
(3) the talent level of opposing fielders varies
(4) variations in ballparks, umpires and weather
(5) the player’s talent level varies from game to game
(6) variations in game situations lead to differences in the player’s and opponent’s approach or effort.

Yet I have a hard time believing that such effects vary so drastically from one rate stat to another to explain the huge scale differences in PC’s “stability” counts.


#8    Tangotiger      (see all posts) 2011/11/08 (Tue) @ 21:27

Lex/6: r-squared is the percent of the variance, not the “variation”.

The UNIT of variance is the square of whatever you are measuring.

What we actually care about is the unit, not its square.  We regress the unit, not the square.  Hence, r, not r-squared is what we really care about.

We have a few threads on r and r-squared.  Check them out.


#9    Lex Logan      (see all posts) 2011/11/08 (Tue) @ 21:42

Tango/8: Wikipedia confirms that, and of course it makes sense—but I would’ve sworn the textbooks we use say “percent variation.” Probably my mistake. I’ll be careful to explain that properly to my students.


#10    mickeyg13      (see all posts) 2011/11/08 (Tue) @ 23:33

Brian Burke accidentally sparked a similar mini-argument about R vs. R^2 in the comments of a recent Fangraphs post.  Click my name if you are interested.


#11    MGL      (see all posts) 2011/11/09 (Wed) @ 05:32

My issue is the appearance that “stability” is a binary condition...that certain stats are “unstable” before the threshold and “stable” thereafter.  People want simple yes/no answers for whether or not certain stats can be trusted, but the truth is a far more nuanced answer.  Every single plate appearance gives *some* small amount of information, and these studies help us to find out just how much information that is.

Well said!  I absolutely HATE the notion of a stat “stabilizing” at a certain level, as if it is a binary choice, as mickeyg13 eloquently states.  It is misused so much that it makes me cringe.  People often ask me, “At what point does UZR (or some other metric) stabilize?” and I usually answer with some snark response that pisses them off.  I blame it on the (many) people who write about metrics “stabilizing” at some magical level.  Hate it. Hate it like I hate “statistical significance”.  Actually I hate the “stabilization” thing a lot more…


#12    Kincaid      (see all posts) 2011/11/09 (Wed) @ 10:09

The True Variance in Arvin’s formula is true talent variance across the population.  That is the driving factor for the varying regression amounts for each statistic.

The “stability” that Pizza Cutter, et al are talking about is a different issue from binomial confidence intervals.  Binomial confidence intervals only deal with sampling error (i.e. if you know the probability is .05, the results will fall within a given range 95% of the time).  They do not give you the best estimate for inferring the true rate from the observations if you have additional information about the true rates of the population(unless you assume the population is uniformly distributed).

For example, say you have a hitter who goes 15 for 30, and your 50% confidence interval is .423-.577.  That doesn’t mean that you are 50% sure the hitter’s true batting average talent is within that range.  In reality, it is extremely unlikely that his true rate falls within that range, because we know that there probably are not any hitters that good in MLB.  The 25th-75th percentile for true talent of an MLB hitter who goes 15-30 is probably more like .260-.295, or something like that.  We get that by combining our 15-for-30 observation with out knowledge of the distribution of talent across the population of MLB hitters; both pieces of information are important for understanding the type of “stabilization” being referred to here.

You are right that the random binomial error doesn’t change that much for most statistics, given a certain sample size, which why the variance of true talent across the population is the primary factor in determining the regression amount. 

For MLB pitchers, something like 95% have a true talent BABIP between .280 and .310 (numbers for illustration only), so if someone allows only 20 hits on 100 balls in play, you would still regress that pretty heavily toward the league average.  For strikeouts, however, you get much wider range of true talents (maybe 10-22% of batters faced, or something like that).  If a pitcher strikes out 20 out of 100 batters faced, you’d regress that a lot less than BABIP, because you know from the distribution of talent across the population that it is more likely his K-rate is far from the league average than it is that his BABIP is far from the league average.


#13    Kincaid      (see all posts) 2011/11/09 (Wed) @ 10:13

For the purposes of regression to the mean, r tells you how much of the variance is explained by the variance of true talent across the population.  Since the true rates can’t be measured directly, only estimated by sampling, the correlation is taken between two distinct samples.  The r^2 between the two samples will tell you how much of the variance of one sample explains the variance of the other sample, while the r between the two samples will equal the r^2 between each sample and the distribution of true rates.  What we care about is how much of each sample is explained by the variance of the underlying true rates, so regression to the mean uses r instead of r^2.

For example, say you have two samples, and the r^2 between them is .25 (r=.5).  In other words, 25% of the variance of one sample is explained by the variance of the other sample.  Each sample is itself a mixture of the variance of true rates and random error around the true rates.  Let’s say that the variance of each sample is 50% explained by the variance of true rates.  Since the rest of the variance is random error, which is uncorrelated between samples, all of the 25% of variance explained between samples is from their common relation to the true rates.  As a result, 50% of the variance of one sample explains 50% of the variance of the other sample, which gives us 25% of the overall variance explained.  That’s why the r taken between two samples, and not the r^2, implies the amount of variance explained by the distribution of true rates.


#14    Lex Logan      (see all posts) 2011/11/09 (Wed) @ 10:39

Kincaid/12: Thanks, it will take me awhile to digest that. I’ve already confirmed that the Stat textbooks do, in fact, state that r^2 from a linear regression represents the percent variaTION explained, not variaNCE. A simple thought experiment: if my annual earnings are the sum of my Spring and Fall earnings, each of which is variable, Tango’s comments seemed to imply I should see r = .5 when comparing either of those to the total. But that would translate to a variance of .25 for each, and leave .50 variance unexplained. Since there is no other source of income, the total variance must be 1.00, and so r^2 must average .5 for each.

One problem I have is that I can’t find any online or textbook explanation of the method of “regression to the mean”, except in sabermetric discussions! One textbook actually gives a baseball example, saying you can predict batting average by averaging a player’s past season with the league average for that season, but no reference is cited and no methodology for reaching that conlcusion is given.

I followed Pizza Cutter’s empirical, split-half methodology fairly well, I just want to understand the theoretical underpinnings of such results—why do some stats show so much more variation than others? I’m going to assume that it’s because some stats depend almost entirely on a single player’s talent and others (batting average is a prime example) involve multiple sources of variation.


#15    Tangotiger      (see all posts) 2011/11/09 (Wed) @ 11:02

Lex: may I recommend the Appendix in The Book?  That was written by Andy, and his knowledge of this topic is as thorough as anyone out there.


#16    Kincaid      (see all posts) 2011/11/09 (Wed) @ 11:40

You may have better luck finding textbook information on Buhlmann credibility, which is basically the same as what we are doing with regression to the mean.  I think it tends to be taught as an actuarial concept, so textbooks geared toward that sort of thing might have something on it.  It is also a shortcut for Bayesian probability (under certain conditions, they are equivalent), which would probably be easier to find in a textbook.

Bayesian reasoning has to do with incorporating prior information into our interpretations of observed data.  How much an observation tells us depends on what we know about the population the observation comes from. 

For example, if you flip an unaltered quarter a bunch of times, no matter how many heads or tails you observe, you are still going to estimate its true rate at close to 50/50 simply because all quarters have true rates within a very tight range.  If you took 100 quarters and flipped each one 10 times, and then repeated the experiment, you’d see very little correlation between the results of the two experiments.  You would be regressing the outcomes close to 100% toward the mean (i.e. whether you observe 2 out 10 heads or 8 out of 10 heads, you’d still estimate the true rate at about .5).

Now say you have 100 weighted coins, all of which have different true rates.  This time when you flip each one 10 times, and then repeat, you’ll get a higher correlation between the two samples.  With the fair coins, virtually all of the variation in the results was due to random variation.  With weighted coins, you still have close to the same amount of random variation, but you also have additional variation from the true rates.  The variation of true rates will be persistent, so the results from your sample will tell you something about how much the true rate likely differs from average.  You would regress these results less than the results from the fair coins, because more of the variation from one sample will persist outside of that sample.

Baseball stats essentially work the same way.  Different stats have different amounts of true talent variation across the population, so they regress differently.  Some stats, like pitcher BABIP, have relatively little variation, while others, like strikeout rate, have a lot more variation.  There can be many reasons for the differing amounts of true talent variance.  One is how many other factors are involved besides a player’s own skill.  Another is the selection criteria for the population; for example, pitchers in general can have widely varying BABIP talents, but MLB pitchers don’t because it is virtually impossible to break into MLB if you are really bad at preventing hits on balls in play.  A pitcher can have a poor strikeout rate and be an MLB quality pitcher by limiting walks and home runs, but there are not many pitchers who have particularly high true talent BABIP rates who also have enough other skills to offset that.


#17    Lex Logan      (see all posts) 2011/11/09 (Wed) @ 11:50

Thanks again. I think I’m getting it: in my thought experiment, I was assuming Spring and Fall income had no correlation. But if I gathered data for multiple tutors, their data would show some correlation between Spring and Fall—their “true talent level” (subjects taught, hours available, perceived competence, etc.) That’s more like what we’re doing with baseball analysis.

Tango, I’ll review the appendix, thanks.


#18    Tangotiger      (see all posts) 2011/11/09 (Wed) @ 12:12

The analogy to baseball analysis is student test scores.

***

Kincaid: great post.


#19          (see all posts) 2011/11/09 (Wed) @ 20:21

MGL,

Would a better way of stating it be something along the lines of:

“X stat becomes reasonably meaningful after Y number of observations. This means that it holds more signal than noise.”

That is the point of setting the threshold at r=.5 right? Preferably, a less subjective word than reasonable would be used.

fwiw, I’m pretty sure a statistician would insist on R^2>.7. At least that’s how we did things in Econometrics and the prerequisite stats class.


#20    MGL      (see all posts) 2011/11/09 (Wed) @ 21:34

"“X stat becomes reasonably meaningful after Y number of observations. This means that it holds more signal than noise.””

The second part but not the first.  The first part would imply that .45 is not “reasonably meaningful” but .5 is?  That is ridiculous. And who anointed .5 or .7 as the threshold at which something becomes “meaningful?”

And what does “meaningful” mean in this context?  As many people have already pointed out, ANY sample is “meaningful.” It just depends on how much.  .4 is certainly meaningful.  And if you are using .7 as your threshold, what, .6 is not meaningful?

The whole thing is a ridiculous way of couching the concept of regression and signal to noise ratio. The worst part is that it leads to too many misunderstandings of the data and the conclusions that the data suggest.

Simply take the observational data, do the appropriate regression, and use the result to make whatever point you want to make…


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 11:02
Do pitcher’s reach back for velocity when needed?

May 25 10:58
Rooting for laundry

May 25 10:14
Largest demonstration in Canadian history?

May 25 09:39
What sabermetrics is NOT

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story