THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, March 18, 2010

Statistical Significance, or the reason that mathematician Ron Fisher is on MGL’s “On Notice” Board

By Tangotiger, 09:34 AM

Glove-slap Chuck:

Fisher’s P value eventually became the ultimate arbiter of credibility for science results of all sorts — whether testing the health effects of pollutants, the curative powers of new drugs or the effect of genes on behavior. In various forms, testing for statistical significance pervades most of scientific and medical research to this day. But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion.

This article is so well-written and well-researched, but I think this is misleading:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct).

There’s a gap between the main part of the sentence and the parenthetical part.  The parenthetical part is correct and what we care about.  I would say, he should have said:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance that the observed result occurred if no real effect exists

That is, it’s not that what we observed is an indication that this particular observation is real.  It means that what we observed is an indication that something is going on TO SOME DEGREE, from non-zero up to what we actually observed. 

If someone wants to take a stab at how to better phrase, please do so. 

The main point is that if the p value of a .330 OBP hitter who gets a .530 OBP in the clutch is .05, this ONLY means that there’s 5% chance that he got an OBP that high by luck, and so we have to conclude that we’re 95% sure that his true OBP is somewhere above .330, though the information provided here tells us nothing about our best guess as to his true OBP in the clutch. 

It does not mean that we’re 95% sure that his clutch OBP is .530.  And this last part is pretty much how I see conclusions being made.


#1          (see all posts) 2010/03/18 (Thu) @ 10:04

I haven’t read the article yet, so I’m not sure if its necessary or fits, but I would go a step further and clarify exactly what you’re saying in the final few paragaphs:

“Correctly phrased, experimental data yielding a P value of .05 means that, if we assume that no effect exists, there is only a 5 percent that the observed result could have occurred.  We can reliably reject the null hypothesis (no effect) and assume that some effect exists, but we cannot determine the true effect without more testing.”


#2    Peter      (see all posts) 2010/03/18 (Thu) @ 10:15

Well said, and this is why I’d say confidence intervals>p-values for any situation where you care about the size of the effect and not just whether it’s non-zero, even though c.i. and p-value are based on the same underlying calculation.


#3    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 11:19

"confidence intervals>p-values”

That should read:

confidence intervals >>>>>>>>>>>>>> p-values > no study


#4    wcw      (see all posts) 2010/03/18 (Thu) @ 12:12

The article isn’t bad, I guess, but it spent a whole lot of words to communicate much what I can by telling someone, ‘google Bonferroni.’


#5    J. Cross      (see all posts) 2010/03/18 (Thu) @ 12:19

we’re 95% sure that his true OBP is somewhere above .330

ok, but, if I’m understanding you correctly, this relies on the assumption that before the experiment we thought there was a 50% chance that his clutch OBP was above .330 and a 50% chance that it was below .330 (although this seems like a fine assumption in this case) and were assigning a 0% chance to the idea that there’s no “clutch effect”.


#6    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 12:23

J, you are the professor, so you can correct me.  But if I understand the process correctly, the null hypothesis is that there is no such thing as a clutch skill (so were starting with the opinion that there is 100% chance of no skill).  So, we’re not assigning any probabilities.  This isn’t a Bayes issue where we’re starting with some prior.


#7    Greg Rybarczyk      (see all posts) 2010/03/18 (Thu) @ 12:32

Tom, I think I agree with the statement you’ve made here (that the 0.05 p-value only tells us there is likely a difference, but doesn’t prove some amount of difference), however, let me pose a scenario:

Suppose we were presented two sets of SAT scores, group A and group B.  We’re told nothing else about them.  Now we perform some sort of hypothesis test on the two groups (let’s say a 2-sample T-test), and we get the following results, having set our null hypothesis to be, “the two groups are drawn from the same population”:

group A: mean 420, SD 40
group B: mean 660, SD 40
p-value: 0.001 (I’m making up this number, but whatever it is, it’s small)

Now, I would interpret these results as follows:

- the likelihood that groups A and B, if truly drawn from the same population, would end up this dissimilar, is 0.001 (or whatever the actual p-value would be for these numbers).
- this likelihood is sufficiently small for me to reject the null hypothesis, and accept the alternate hypothesis that the two groups were drawn from different populations.
- now that I have concluded that the two groups are from different populations, if someone were to ask me what I considered the most likely mean value for group B, I would have to say 660. 

Because I have concluded that group B is unrelated to group A (having been persuaded by the data), it would not make sense for me to say either that a) I think the most likely mean for group B is 420, or b) that I don’t know what the most likely mean for group B is (remember, I am saying what is most likely given the data I have, I am not asserting that I know what group B’s mean is, those are two very different things).

Now, let me say that the baseball situation described her is DIFFERENT.

Here are what I think are the differences between my scenario and the “overall OBP” vs. “clutch OBP” comparison:

1.  Overall OBP and Clutch OBP overlap.  This is the most obvious one.  Some data points exist in both sets.

2.  Most reasonable people would expect Overall OBP and Clutch OBP to be correlated (i.e. not independent), based on knowledge of the two metrics, and a long history of interacting with such data.

For these two reasons, I don’t think most analysts would ever fully decide to reject the null hypothesis and regard Clutch OBP as entirely independent of Overall OBP.  Even if the simple hypothesis test in this case suggests that we reject the null hypothesis, I think most people wouldn’t.

Therefore it is proper to believe that a low p-value doesn’t mean you really know anything about the likely true value of Clutch OBP (other than it is likely larger than Overall OBP for this proposed case)…


#8    J. Cross      (see all posts) 2010/03/18 (Thu) @ 12:39

I teach chemistry (and only dabble in math) so I have absolutely no business correcting you.  If I want to be a good Bayesian I should start with the assumption that you’re almost certainly correct but here’s my understanding:

I’m saying that, of course, there’s some clutch skill - it might just be incredibly incredibly small but it *has* to exist b/c there’s no such thing as “no effect” in the physical world.  If I start with the assumption (pre-study) that there’s a 50% chance of positive clutch skill and a 50% chance of negative clutch skill, that p-value shifts my opinion to thinking that there’s a 95% chance of a positive clutch skill and only a 5% chance of a negative clutch skill.  Without knowing the population of possible clutch skills within the population, that’s as much as I can say.


#9    Greg Rybarczyk      (see all posts) 2010/03/18 (Thu) @ 12:47

No such thing as no effect?  I disagree.

What effect does the color of my T-shirt that I wear at home while watching the game have on any result in the game?  Is is zero (my position), or something very small but non-zero (your position, I think)?

Unless you’re making one of those “butterfly effect” arguments, I think there are nearly infinite numbers of factors that have no effect on a baseball game.

Now, if you took data to measure the effect of my shirt color 2,500 miles away, you would OF COURSE calculate some non-zero impact, most likely small but perhaps not, but this would OF COURSE be statistical noise and not a real effect.  Is that what you meant by “no such thing as “no effect”?


#10    Sunny Mehta      (see all posts) 2010/03/18 (Thu) @ 13:36

"if the p value of a .330 OBP hitter who gets a .530 OBP in the clutch is .05, this ONLY means that there’s 5% chance that he got an OBP that high by luck, and so we have to conclude that we’re 95% sure that his true OBP is somewhere above .330

I’m pretty sure that’s incorrect, as written. Frequentist methods, by definition, do not offer insight into the probability of the parameter. They tell you the probability of the sample, given a hypothetical parameter.

If you observe a sample of a .530 OBP, no p-value or confidence interval can tell you the probability of his true OBP. They just tell you that IF his true OBP were .330 and you took an infinite number of samples (of the same size as the observed sample), 95 percent of them would be below .530.

P(A|B) is not at all the same thing as P(B|A).


#11    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 13:42

- now that I have concluded that the two groups are from different populations, if someone were to ask me what I considered the most likely mean value for group B, I would have to say 660.

That’s not correct.  The correct answer is: between 420 and 660 AND it is higher than whatever group A is.

No such thing as no effect?  I disagree.

When I use the “of course it has to exist” argument, I say it this way: it is impossible to introduce a human being to any environment and think that that person will not be affected differently relative to other humans in that environment.

So, take you 800 MLB players, put them in a blowout game.  Take the same guys and put them in a World Series game, and you are guaranteed to get some sort of Rich Ankiel and Mariano Rivera come out.  And I can say taht because people are humans and not automotons.

The problem is trying to figure out who those people are before the experiment is over.  Which is why it’s a fool’s errand to focus on clutch performance: even if you find it, it’s too late, and of too little use to do anything with it.


#12    J. Cross      (see all posts) 2010/03/18 (Thu) @ 13:50

Well, okay, I think I really mean to say no such thing as “no relationship” not so such thing as “no effect.” That, at least in theory, if we were to replay that day over an infinite number of times there would be some relationship (undoubted incredibly small) between the instances when you wore a read shirt and the instances when your team won - maybe both were dependent on the weather to some extent.

In the case of “clutch effect” there are almost certainly differences in the kinds of pitchers (hard throwers v. soft throwers, or whatever) in these situations v. all situations and any batter no doubt is either somewhat worse or somewhat better against these pitchers… now, all of this might not amount to much.  It’s probably negligible.  But it can’t really be zero. 

I’m not saying this just to be a pest.  If we no know that the null hypothesis is wrong to begin with, what does it mean to reject it? 

If we broaden the null hypothesis to “clutch hitting effect is negligible” then I think we have to know what portion of the population that has “negligible” clutch skills in order to correctly interpret our p-value. 

Let’s say that only 1/1000 players have non-negligible clutch skills and I get a p-value of 0.05.  Then, it’s still (roughly) 50 times more likely that this player has negligible hitting skills than that he has real clutch ability.  I’d be a fool to reject the null hypothesis.


#13    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 14:00

In that case, you are doing Bayes, since you are applying a prior.

Take for example the difference between starting and relieving.  I find a difference of 1 run per game of the same pitchers in different roles.

IF I KNEW NOTHING about baseball, I would look at the difference, conclude that there is a statistically significant non-zero difference, and come out with an interval of say 95% sure it’s between 0.1 and 0.7 runs or something.

That’s because I applied no prior.

But, I can re-run that study for every decade, and I’ll always find the 1 run difference.

Now, what if instead I apply Bayes?  Well, in this case we EXPECT to be a difference because we KNOW a pitcher throws harder the fewer innings he’s expected to throw, and he paces himself more as a starter, etc.

And if we had a prior of 1 run difference, it’s no surprise at all that we get 1 run in our decade-by-decade sample.

So, I can look at the differences, and come up with different confidence intervals, depending on whether I introduce a prior, or I am simply testing to see if the results were random.


#14    Greg Rybarczyk      (see all posts) 2010/03/18 (Thu) @ 14:16

OK, Tom, we’re getting stuck on whether there is some unspoken but assumed commonality behind the two samples, such as “both groups are baseball players”, or “Both groups are college-bound high school students”.

Let me try another example:

I have two groups of mechanical gears: sample group A is a set of very small watch gears, with mean diameter of 0.10 inch, while sample group B are gigantic reduction gears for a large ship, with mean diameter of 8.0 feet.  Never mind the SD’s, they don’t matter.

Supposing I only had the data sets, consisting of measurements of the diameters, and didn’t know the provenance of the two samples.  If I ran a 2-sample T-test on the two data sets, I would get a vanishingly small p-value, and would no doubt conclude that the two samples came from different parent populations.  Then if asked, “what is your best guess as to the population mean of sample B”, I would say 8.0 feet.

I don’t think you are really suggesting that because a length measurement of an unrelated sample group yielded a smaller value for diameter, then my estimate of group B’s diameter should be influenced in any way by group A’s measurement.  Are you? 

If you were, you’d also believe that because I once timed a 400 meter race (average time: 1 minute) with my stopwatch, my best guess as to the typical flight time of a home run (average: 4.8 seconds) should be adjusted upwards to something between 4.8 and 60 seconds.

I think instead that you are taking into account an underlying assumption you have that the two data samples in question (Overall OBP and Clutch OBP) are drawn from a single overall population with fundamental similarities, and inside that one population (MLB players), you are looking for subgroups: people who perform differently under certain circumstances.

This is fine.  It’s probably true.  Taking that into account for baseball analysis makes sense.  It’s a related effect to regressing baseball performance parameters. So when you measure a pitcher’s fastball speed, you regress your measurements back towards some population mean before you say anything about true performance levels, and when you measure someone’s Clutch OBP, you regress it back towards Overall OBP for the league, or for whatever player subgroup you think the player belongs in.

Just recognize that it is this assumption of a common underlying population that leads you to do this.  It is not inherent in all situations (nor really in most) where you use hypothesis testing.  Reduction gear diameters have nothing to do with watch gear diameters.  Product delivery times by ship have nothing to do with delivery times by airplane (until each lands, that is).  Etc…

As for the disagreement about no effect, obviously I was disputing the very general point JCross made, and not a more specific assertion about whether a particular effect is non-zero.  Obviously some effects are non-zero.  Just as obviously, IMO, some effects ARE zero.


#15    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 14:24

Yes, definitely, the assumption is that the reason you are comparing the two samples is that they are somehow drawn from the same population.  We are testing whether they were RANDOMLY drawn from the same population, or if there is a BIAS in how the selections were made.


#16    anon      (see all posts) 2010/03/18 (Thu) @ 14:27

Tango 11: “The correct answer is: between 420 and 660 AND it is higher than whatever group A is.”

I’m nit-picking a bit, but the power of regression to the mean depends on how many populations you’re looking at and what you know about them before hand.  With only two group means, there’s typically no benefit in shrinking 660 towards 420.  If you have 3 or more, then it becomes valuable and the value increases with the number of groups. 

It would also be valuable to shrink 660 towards the SAT population mean (around 500), but that again presupposes additional information about the SAT.


#17    lincolndude      (see all posts) 2010/03/18 (Thu) @ 16:03

Tangotiger, I have to disagree with you regarding the passage you quoted.  I believe it precisely articulates the point the author is trying to get across.  If the null hypothesis is true and the p-value is .05, you would see the observed result (or a more extreme result, which is a key qualifier—there has to be a range) only 5% of the times you ran the test.

Is this equivalent to saying there is a 95% chance that the null hypothesis is false?  I’m not sure about that.

But with careful use of this idea, it’s clear that nobody should be saying there’s a 95% chance that, say for your example, clutch ability is .530.  Drawing that conclusion strikes me as simply being sloppy.


#18    Tangotiger      (see all posts) 2010/03/18 (Thu) @ 16:20

But with careful use of this idea, it’s clear that nobody should be saying there’s a 95% chance that, say for your example, clutch ability is .530.  Drawing that conclusion strikes me as simply being sloppy.

I would say that at least 50% of the time, if not far more, when I see conclusions on statistical tests, it is of the “.530 is real”, rather than “.530 shows that it’s more than .330”.


#19    J. Cross      (see all posts) 2010/03/18 (Thu) @ 16:21

My impression is that without doing Bayes, in other words without any prior (including an unspoken prior of “all outcomes are equally likely") I can’t draw any conclusion from a p = 0.05 other than “If the null hypothesis were correct, an effect this size or larger would happen 5% of the time.”

Without a prior I can’t say anything about the chances that there is an effect and I can’t say anything about the likely size of the effect. 

I can say “this result would be unlikely with an effect less than/more than x size” but that doesn’t tell me how likely it is that there’s an effect of x size.

I think I’m trying to say what Sunny/10 said:

Frequentist methods, by definition, do not offer insight into the probability of the parameter. They tell you the probability of the sample, given a hypothetical parameter.

I’m on board with the point that we’re not choosing between .350 and .530 OBP based on a p-value but, I think, we’re also not figuring out how likely it is that the OBP is above .350.  We’re evaluating how likely finding .530 would be if it really was .350.

I also think there’s a reasonable probability that I’m missing the point of what’s being said.


#20    lincolndude      (see all posts) 2010/03/18 (Thu) @ 16:35

Tangotiger/18, Fair enough, and that’s crazy.  I’m just saying that the passage you quoted is very precise.  Reading it carefully should not lead anyone to that conclusion.

J. Cross/19, I’m with you here:

“but, I think, we’re also not figuring out how likely it is that the OBP is above .350.”

I’m curious about this.  Is it fair to reverse the logic and say that if finding that high a value given the null hypothesis is unlikely, then the chance that the null hypothesis is incorrect is likely?  And to just subtract the probability from 1 (ie. going from the 5% statement to the 95% statement)?  It seems like there’s a leap being made there, but perhaps I’m not seeing it.


#21    J. Cross      (see all posts) 2010/03/18 (Thu) @ 16:54

So,

Prob(null hypothesis given this results) = Prob(this result given null hypothesis)*probability(null hypothesis in population)/probability(not null hypothesis in population)

or P(A|B) = P(B|A)*P(A)/P(B)

the p-value gives P(B|A), the probability of the this result given the null hypothesis.  This is the same as the probability of the null hypothesis given this result only if null hypothesis and non-null hypothesis, P(A) and P(B) are equally likely.

I’m really hoping I didn’t mess that up.



#23          (see all posts) 2010/03/19 (Fri) @ 09:39

I agree with lincolndude/17, that it looks like the author phrased it fine in the original.  I think he’s saying the same thing as Tango is.

And I don’t think the author is saying you conclude .530 (but I don’t think Tango is either).


#24    Greg Rybarczyk      (see all posts) 2010/03/19 (Fri) @ 09:46

Well done, Phil!


#25    Tangotiger      (see all posts) 2010/03/19 (Fri) @ 09:52

Right, he said it correctly, but I think the lay person will jump to such a conclusion.  So, he’s right parenthetically, he’s right non-parenthetically, but I don’t think he’s shown WHY he’s right.  I don’t know that the lay person would make the connection, about the implication of what I’m saying.


#26    J. Cross      (see all posts) 2010/03/19 (Fri) @ 10:13

Yes, a very clear explanation by Phil on a topic that’s tough to present in understandable terms.


#27    Scott Segrin      (see all posts) 2010/03/20 (Sat) @ 08:50

Hi Tom and everyone else here.  I apologize if I’m butting in the middle of a conversation, but there’s been an issue that’s been bugging me for years and it seems appropriate here.

When I was in college (some 30 years ago), one of my statistics professors scolded me (as best I can remember), “You baseball guys have it all wrong.  You treat a set of baseball statistics as if they are a random sample and run all of your significance tests.  Baseball statistics are not a random sample of anything - they are a full and complete set of data.  To analyze them using statistical techniques is a misuse of the science.” Being as stubborn as I was back then, I disagreed with him.  But as the years have past, I never forgot what he told me and now in my middle-aged years have come to believe that what he said has some merit.

Take clutch hitting.  Suppose Larry bats 17 for 50 (.340) in clutch situations and 100 for 400 (.250) in non-clutch situations.  A standard statistical test will tell you that the difference between these two proportions is not significant at either a 95% or 90% confidence level, therefore you can not conclude with a high degree of certainty that Larry is a better hitter in the clutch.

But the fact is that Larry WAS a better hitter in clutch situations - a MUCH better hitter.  Those 17 hits he got probably won his team a few games that they otherwise would not have won.  That is very significant.  To apply a statistical test to this data and conclude that we can not prove that Larry is a good clutch hitter is to say that there are other at-bats that Larry had that we don’t know about and if you measured those, the difference in his average might not be as great.  This simply isn’t true.

Another analogy would be the U.S. Senate voting on a bill.  If 51 Senators vote in favor of the bill and 49 vote against, even though a statistical test would tell you that there is no difference between to proportion of Senators who are in favor of or oppose the bill, this is regardless a very significant outcome because there are no other Senators to ask.  You can say with 100% certainty that more Senators are in favor of the bill than not.  Just like you can say with 100% certainty that Larry was a better clutch hitter than not.

When we analyze random samples of data, we do so to predict that which we can not measure, or is impractical to measure, or is too expensive to measure.  If we have a machine that fills boxes with breakfast cereal, we don’t tear open every box at the end of the assembly line to be sure they are filled properly.  We only test a sample and then assume that the others have similar characteristics.  But in baseball we do measure everything.  EVERYTHING.  There is no data that is unknown.  When we apply statistical techniques to this type of data, we area only analyzing events that will never occur.  Perhaps a fools errand.


#28          (see all posts) 2010/03/20 (Sat) @ 09:25

The short answer is that your really *are* only looking at a sample of the player’s AB ... he has the potential to provide thousands of them, but MLB will only allow him to sample 600 or so a year.

If looking at “everything” was a deal-breaker, nobody could ever experiment.  Because, when they experiment, they look at all the data they created with the experiment (like when you give 100 patients an experimental drug and see what happens—you look at all 100 patients). 

Or the classic ball/urn case: if you draw 100 balls with replacement, you’re analyzing EVERYTHING: all 100 draws.  But it’s still a random sample of the draws that *could have* occurred.


#29          (see all posts) 2010/03/20 (Sat) @ 09:43

Phil:  In the classic ball/urn case, we use the 100 draws to make inferences about the *other* balls that we know are in the urn.  The other balls are there.  They are real.  In the case of baseball, there are no other games - no other at-bats - no other clutch situations - no other balls in the urn.  Perhaps there could have been, but there weren’t.  A 51-49 Senate vote could flip if we had 52 states.  But we don’t have 52 states - we only have 50 - so we accept the outcome of the vote as definitive.


#30          (see all posts) 2010/03/20 (Sat) @ 09:55

Okay, point taken about the urn.  You are making inferences about the balls that *could have* been drawn, but weren’t.

For a player’s AB, are you saying that no more AB *could have been* taken?  That is, if Albert Pujols had been called up for one more AB, he would have said, “nope, can’t do it, used ‘em all up”?

Same idea.  The potential AB are there, even if you can’t see them.  You’re sampling 500 AB from a pool of an infinite number of potentials.

The situation your professor might have been referring to is when you DO sample everything.  Urn contains 50 balls, you sample 50 balls without replacement.  No statistical inference on the balls in the urn—because you really HAVE sampled everything.  However, if you want to assume that the 50 balls in the urn came randomly from some distribution, than you CAN do inference on what that distribution looks like.


#31    J. Cross      (see all posts) 2010/03/20 (Sat) @ 12:00

no other balls in the urn

Scott, I’d take issue with this.  In baseball, at the end of a career, say, there are other balls in the urn, there just aren’t any more balls drawn from the urn.  A plate appearance is modeled as drawing a ball from the urn and replacing it.  The plate appearance isn’t one of the balls.  The ratio of balls in the urn (representing the true talent of the player) exists whether those balls are being drawn or not.


#32    Greg Rybarczyk      (see all posts) 2010/03/20 (Sat) @ 12:31

Scott,

If the 50 clutch AB’s (in which the hitter got 17 hits) ARE the complete data set, then there can’t be any other elements in that set, and therefore any clutch AB that arises later must be from another set, right?  In which case the prior data has no relevance, right?  In which case, what value does the designation of this hitter as a clutch hitter have?  None.

Around here, people usually talk about the whole clutch hitter/pitcher debate in terms of predictability, i.e. can you look at any prior performance and use it to inform your expectations about what will come in the future.  Simply looking back in time and saying “this was a clutch hit”, or “that was a clutch strikeout”, or “he had a great year in the clutch” can of course be true (David Ortiz has had lots of clutch hits), but may or may not have any bearing on what Ortiz does in his next clutch-hitting opportunity.

You often hear it said: “There are clutch hits, but no clutch hitters.” The debate on that is not over by any means, but it’s a different debate than what you’re describing here, I think.


#33    Fargo      (see all posts) 2010/03/21 (Sun) @ 08:58

Interesting article in Science News: “Odds Are It’s Wrong: Science Fails to Face Its Statistical Shortcomings”: http://www.sciencenews.org/view/feature/id/57091/title/Odds_are,_its_wrong


#34    Tangotiger      (see all posts) 2010/03/21 (Sun) @ 10:50

Fargo, perhaps it’s not clear, but clicking on “Chuck” is the lead to this thread, which is the same as your link.


#35    Fargo      (see all posts) 2010/03/21 (Sun) @ 12:03

Thaks. My bad.


#36          (see all posts) 2010/03/24 (Wed) @ 13:07

Sorry to be jumping in on this so late, but I was away all last week and was not keeping up with the various blogs.  I have now gone back and read all of this thread as well as Phil’s blog and the comments there. 

But I take issue with a quote from Phil’s blog:

The above is my paraphrase of something Tango posted yesterday. He cites the above, and then makes an important caveat: the fact that you reject the null hypothesis—that you’re rejecting the idea that there’s no difference between clutch and non-clutch—does NOT mean that you should conclude that the actual difference is 70 points. All you can conclude, Tango argues, is that the difference isn’t zero. For all you know, it might be 30 points, or 10 points, or even 1 point. You are NOT entitled to assume that it’s 70 points, just because that’s what the actual sample showed.

I don’t want to get caught up in semantics, but I am pretty sure I don’t agree with the statement that we are not entitled to assume that the difference is 70 points (using no other information than the two measurements).  In fact, our best estimate of the difference is what was actually measured, namely 70 points.  As with all measurements in science and other areas, that number by itself has no real meaning without an error bar.  So, it is better to say that our best estimate of the difference is 70 +/- xxx.  It is the size of the xxx relative to the 70 that determines the statistical significance (the p value, if you like).  In my field, it is common to let xxx be the 1-sigma level (roughly the 68% confidence level).  The 95% confidence level is approximately 2-sigma.  Evidentally, in this case 2-sigma is less than 70, since the the null hypothesis is rejected at the 95% confidence limit.  But, just because we choose to quote the result that way (i.e., as rejection of the null hypothesis at the 95% confidence limit) does not mean that we have no right to say that we have a measurement.  We do have a measurement, albeit not a very precise one.  And by the way, the actual number is as likely to be greater than 70 than less than 70 (again, using nothing else other than the actual measurements).

But let me say that I agree with the overall thrust of Phil’s discussion.


#37          (see all posts) 2010/03/24 (Wed) @ 13:24

Alan/36: If you have no other information than this one test, and you are willing to assume that any value is legitimately as likely to be true as any other value, then, yes, your best estimate is 70 points. 

Tango’s point was that it’s NOT true that you have no other information.  And the other information we have makes 70 points not very likely.


#38          (see all posts) 2010/03/24 (Wed) @ 13:36

Phil/37:  I agree with your point about other information.  However, I am not sure the qualifier about using additional information is in the part I quoted.  But never mind.  If you are in agreement with my point, I am happy.


#39          (see all posts) 2010/03/24 (Wed) @ 13:42

OK, I see what you’re getting at ... there are a couple of issues mixed together here.  Let me think about it a bit.


#40    Tangotiger      (see all posts) 2010/03/30 (Tue) @ 10:45

Scott/27 posted his post on his blog, and added this:

I think these guys are wrong.  I also think that their opposition to my point of view is in part their trying to justify what they’ve spent much of their lives doing.

I will say that Scott is wrong.  A player can have a total of 10,000 PA in his career, but that is still a sample of his true talent level.  The 10,000 PA may represent his “population” of ... something, but it is still a sample of his talent level.  Just like if you had all 10,000 tests and quizzes and assignments I’ve ever taken from K through college still represents a sample of my talent level.

I’m not justifying anything.


#41    Greg Rybarczyk      (see all posts) 2010/03/30 (Tue) @ 11:29

Wow, I’m surprised that of all the stuff in this thread, he chose to disagree with me!

Let me now do a little “reductio ad absurdum” on this argument:

Every major league hitter has a first at bat, and before too long most will have their first clutch at bat.  If we are not going to consider these at bats as samplings from a larger population, then here’s how our opinion of a hitter would progress:

1st AB:  Hit 1.000 “Wow, what a fantastic hitter!  He never makes an out!  Next stop, Cooperstown!”

2nd AB:  Out .500 “This kid really rakes!  He only makes an out every other at bat!”

3rd AB:  Out .333 “We could be looking at another Rod Carew here, 1 hit out of every three, keep your eye on this talent!”

4th AB:  Out .250 “He’s definitely cooled off, but still has potential to be a productive regular...”

5th AB:  Out .200 “Those early gushings by the media seem to have been misguided, what could they have seen in him that justified such praise?”

6th AB:  Hit .333 “The early prognostications were right!  This kid is bound for glory!”

7th AB:  Out .286 “While fame seems a bit too much to expect, a long career and an occasional All-Star mention could be in the cards...”

8th AB:  Out .250 “This team may eventually want to upgrade at the kid’s position...”

etc.

If you’re going to say that 17 hits in 50 AB’s is not a sampling of his hitting, but rather IS his hitting, period, then you must believe that all the way back down that player’s career path, to the day he first steps to the plate in the big leagues.  Which means that at one time, EVERY hitter in MLB *WAS* either a .000 hitter, a .500 hitter or a 1.000 hitter after his 2nd at bat, and after his first at bat, EVERY hitter in MLB *WAS* either a .000 hitter, or a 1.000 hitter.

News flash:  there are no .000 hitters in MLB, nor are there any 1.000 hitters in MLB…

Reductio ad absurdum.


#42    Greg Rybarczyk      (see all posts) 2010/03/30 (Tue) @ 11:43

Incidentally, I’ve spent much more of my life analyzing test results and field reliability data than I have analyzing baseball numbers.

If at any time in my career I had told someone that because 17 out of 50 units I was looking at had failed, that meant that the failure rate for that product line was 34% (not, mind you, that the failure rate is likely some value near 34%, but that the failure rate IS. 34.  PERCENT.  PERIOD.), I would have been laughed at and fired (maybe not in that order). 

Anyone with the slightest bit of experience would know that the 50 units were a sample, and that I’d still have to reckon with the implications of the random sampling of the population that “picked” those 50.

Looking backwards, of course, say what a player did, and describe is as such (Player X hit .340 last year).  No one’s arguing that he didn’t.  We’re arguing that getting 17 hits in 50 AB’s doesn’t mean a player is TRULY a .340 hitter.  We’re trying to go beyond mere description, to say something about the likelihood of what is to come.  Why is that so hard to grasp?


#43    Tangotiger      (see all posts) 2010/03/30 (Tue) @ 11:46

And Scott is wrong here:

But the fact is that Larry WAS a better hitter in clutch situations - a MUCH better hitter.  Those 17 hits he got probably won his team a few games that they otherwise would not have won.  That is very significant.  To apply a statistical test to this data and conclude that we can not prove that Larry is a good clutch hitter is to say that there are other at-bats that Larry had that we don’t know about and if you measured those, the difference in his average might not be as great.  This simply isn’t true.

The statistical test that he talks about in the previous example is correct: the population is all his PA, and the sample is those that occurred during clutch situations.  And therefore, it’s correct to ask if those sample PA could have been randomly chosen from the population of PA.  That’s all good there.

His paragraph that I quoted here has nothing to do with his true talent level of other PA he could have had but didn’t.  His presentation is conflating two things.


#44          (see all posts) 2010/03/30 (Tue) @ 11:50

Let me jump into the fray by posing a question.  How many PA does it take to establish a batting average?  Recalling the point I raised in #36, the question by itself has no meaning unless you also specify the precision with which you want to determine the BA.  Once you specify the precision, and once you assume that every PA is just like every other PA, then the answer is just an application of binomial statistics.  IIRC, all of that was covered quite well in The Book.


#45          (see all posts) 2010/03/30 (Tue) @ 15:09

I’m back.

Let me start with this – I did re-post my original message here to my blog and did make the following follow up comment to it:

“I think these guys are wrong.  I also think that their opposition to my point of view is in part their trying to justify what they’ve spent much of their lives doing.”

If I was going to call somebody “wrong”, I should have done it here rather than on my blog.  I apologize for that.  Using the term “wrong” actually too strongly states how I feel about this, but that’s neither here nor there.  I said it and it was wrong to say.

The statement I made about justification was simply pointing out obvious human nature.  It is exactly what I did when this was first said to me.  I was convinced that I was right and the professor was wrong because I didn’t want to believe that what I had been doing all along was invalid.  Again, I didn’t mean to ruffle any feathers by it.

Mea culpa.

I have today posed this issue to three other people who who have advanced college degrees and who use mathematical statistics extensively in their jobs, but are at best only casual fans of baseball.  I asked, “If a player bats 500 times in a season, can those 500 at bats be considered a random sample and used in statistical analyses the way any other random sample could be?” All three people said “no”, and stated reasoning very similar to what I presented here.  The college professor who first argued this with me some thirty years ago was the Chair of the Math Department at a major U.S. university.  He’s obviously no dummy.  I just think that everyone here needs to be open to the fact that there is an alternate point of view on this issue.

I’m a little bit discouraged that no one here has argued my side.  Nonetheless, I think this discussion is healthy.  This issue has been gnawing at me for some time and I’ve always wanted to present it in a forum where it could be seen and debated by people whose talents and opinions I respect, and who could shed more light on it than I could myself.  In that regard, I feel I have succeeded in what I had set out to do.

For what it’s worth, after reading all of your feedback, my own opinion is at least back in the gray.  Can 100 flips of a coin be used to determine whether a coin is fair, or what the next 100 flips of the same coin will be?  I think the answer is ‘yes’?  Isn’t this the same thing as the baseball argument?  I think maybe it is.  In hind sight, I wish I had asked my professor these same questions.


#46    Tangotiger      (see all posts) 2010/03/30 (Tue) @ 15:33

I think you are asking the professor the wrong thing. If you asked me that, I would agree with the professors. 

You are not asking the question WE are asking.  Ask them this:

Are the 500 times a batter comes to bat in a year a sample of his true batting talent?

Are the 100 tests I have taken in college this year a sample of my true test-taking talent?


#47    Guy      (see all posts) 2010/03/30 (Tue) @ 15:42

Scott, I’m joining this late.  But I think you’ve got a semantic or communication problem here.  It’s pretty much inconceivable that three people with a deep understanding of statistics would object to treating a player’s statistics over a season as a sample of that player’s talent at that time.* However, you asked if this data should be used “the way any other random sample could be.” Now, in most cases such samples are used to measure the property of a larger population.  For example, I’m a survey researcher, and when I interview 1000 American voters, I’m really trying to learn what ALL voters think.  Most medical research is like this, and thousands of other examples.  And no, we are not using a player’s data to stand in for
any other players.  Nor is this a “sample” of some larger pool of PAs by this player in that season—as you say, once the season is over, you have the full “universe” of his performance. 

BUT, it is a sample for the purpose of estimating the player’s talent level and thus predicting his FUTURE performance.  And that’s what the folks here mean by treating it as a “sample.” (Now, it may not be a perfectly random sample in some respects, but it’s close enough to do good predictions.)

As for the point that a player “really did” X, or the Senate “really” passed a law—well, sure.  Lots of baseball analysis looks at players’ actual contributions in a given season.  That’s what metrics like WAR or WPA do. 

So I don’t think there’s a serious disagreement here, just people using the word “sample” to mean different things.....


#48    Guy      (see all posts) 2010/03/30 (Tue) @ 15:44

* Unless they are economists.  :>)


#49    Greg Rybarczyk      (see all posts) 2010/03/30 (Tue) @ 15:47

Scott,

I agree with Tango, we’re talking about different things here. 

If the question is, “Tell me Player X’s batting average for 2008”, then his 2008 stats are a complete record, and you can answer that question using only those data.  No one here is suggesting that to answer that question, you ought to add any hypothetical data, or anything like that.  The thing is, around here no one ever asks that question.

What gets asked around here is, “given that Player X had these stats in 2008, what do those data tell us about this true talent?” Or, “given that player X had these stats in 2008, and these other stats in prior years, what does that tell us about how he might do in 2009?”

In each of these two cases, and in countless other situations, you do need to consider sample size, and the underlying performance distribution, in order to come up with a complete, defensible answer.

I don’t think there is necessarily any disagreement here, perhaps more of a misunderstanding of what we’re each talking about…


#50    Tangotiger      (see all posts) 2010/03/30 (Tue) @ 15:57

Right, this is a pure Bayes issue.  If you ask your professors and they say this is Bayes, then you might have asked the right question.  If they don’t see it as Bayes, then they haven’t been explained well enough.

This is no different than kids taking tests in college.  You are trying to figure out the chance that a certain kid really is an A student based on the 100 tests he took.

The first thing the professor is going to ask you is: what is the mean and spread of the test scores of ALL students.  Then he’ll ask what is the mean (and spread possibly) of the test scores of THIS student.  And then he’ll give you the answer of the chance that THIS student is in fact an A student.

***

What YOU seem to be asking is: given that this student buckled down for 20 of the tests (and the other 80 he basically didn’t study for it), what are the chances that his 90% score in the 20 tests he did study indicative that the studying paid off, compared to the 80% he got in the other 80 tests.

And for that, the professor is going to want to know the SD of his tests when he did and didn’t study (and possibly that of all students).


#51    Guy      (see all posts) 2010/03/30 (Tue) @ 16:33

Greg/Tango:  I think the confusion is more simple, and stems from the rather unusual way in which saberists employ “samples.” Usually, a sample is a small, random portion of a much larger population we care about:  products on an assembly line, blood in your body, people who might be innoculated, voters in an election.  Usually, the larger population is real, or at least may be in the future (e.g. an innoculation program).  But for saberists, our “sample” is a subset of a larger population that DOES NOT AND NEVER WILL EXIST, which is an infinite number of PAs by this player in the year 2009.  It’s sort of a weird thing to care about, except that it helps us to predict what the player will do later, and we care about that a lot.  Although the principle is the same, our application of it is so different that I can see why people have trouble thinking of it as a “random sample.”


#52    Greg Rybarczyk      (see all posts) 2010/03/30 (Tue) @ 17:09

No different than coin flip, really.  It’s impossible to flip a coin an infinite number of times, so any finite number of flips is just a sampling of the coin’s flip outcomes. 

I think people have an easier time visualizing and accepting the 50/50 probability of the coin as what governs the outcomes than they do accepting that a baseball player’s “true talent” is what governs his outcomes.

Recognizing of course that the baseball situation is much more complicated, since there are these other guys called “opponents”, and they are trying to thwart you, and they are all different, and the weather’s different, and the parks, and the game situations, etc. etc. etc.


#53    Guy      (see all posts) 2010/03/30 (Tue) @ 17:55

But who tests coins, outside of statistics textbooks?  Why would you ever do that?  My point is that there are many, many uses of random samples in the social sciences, the hard sciences, and in industry.  The vast majority of times, there is a real, observable population from which the sample is drawn.  In this case, there isn’t. 

A related difference is that we’re trying to measure an attribute in this particular player. That’s not usually the purpose of someone in a sample:  I don’t really care about the opinions of any one of my survey respondents, I care only about the 100,000 other people he represents.  I don’t care that one particular product on my assembly line was flawed, I care about all the others. 

I suppose test taking IS similar to sabermetrics, in that we really do want to measure the ability of that student.  But it’s actually a very poor example/analogy for trying to explain this idea, because most people think of tests as a true measure of skill or knowledge at a point in time.  I think even most professors would view a well-constructed final exam as an “actual” measure of a student’s command of the semester’s material.  So it’s likely not an analogy that will be clarifying for many.


#54    Greg Rybarczyk      (see all posts) 2010/03/30 (Tue) @ 18:23

I prefer the illustrative examples you get from manufacturing.  When you’re getting close to releasing a new product, you have your production line make a number of units for a final test.  These units which you test are obviously just a sample of what the production line CAN make, once you give the go-ahead, but if you want to look at it literally, you are testing every one of the units produced at that stage of your product development.  The handful of units you test at that point are meant to represent the rest of the population that you will eventually make, if the test results are acceptable…

In this example, it’s no different whether you are measuring an attribute (will it survive initial turn-on or not), or a continuous variable (it’s weight, speed, power consumption, etc).

I guess there probably isn’t any one ideal example or analogy to help everyone get it.

As an aside, doing market research must be quite tricky, in that not only do you have sampling error for who responds or picks up the phone to answer, but their mood may influence their answer ("no I don’t like romantic comedies!"), or they may be busy, or they may answer with what they wish was true, rather than what is (e.g. “how often do you floss your teeth?")


#55    J. Cross      (see all posts) 2010/03/30 (Tue) @ 18:52

But who tests coins, outside of statistics textbooks?  Why would you ever do that?

An engineer named Joseph Jagger tested roulette tables (by playing many games of roulette at low stakes, of course) to find biases.  He recognized that his results were samples of the “true” probabilities of the roulette tables but he used his data to pick which numbers to play (just as I might use my calculations of true talent to pick players for my roto team) in the future.  He made $5M in today’s dollars.

These “true” distributions exist whether of not the roulette tables are ever spun, of course, even though we only get to see the results when games are actually played.  That should in no way stop us from using statistics to try to ascertain the true distributions of a table.


#56    Tangotiger      (see all posts) 2010/03/30 (Tue) @ 19:27

The test-taking would be a fair analogy if each test (game) consisted of only 2 questions, and each student got a different test that day.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential