THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, May 18, 2009

Another nail in the Hot Hand coffin

By Tangotiger, 03:54 PM

The Baseball Analysts has now become my 1A to the 1 that is Hardball Times.  This time, it’s Sky that looks at how hitters perform after they’ve been on a 30-game hitting streak:

The “during streak” line is actually “performance from game 31 until his last hitless game” (as best as I can figure). 


#1          (see all posts) 2009/05/19 (Tue) @ 09:39

Since my response has nothing to do with the article, I thought I’d move it here. 

MGL, null hypothesis significance testing isn’t as ridiculous as you make it out to be.  Of course you can’t be a slave to a p-value, but don’t mock the baby with the bathwater.

A couple points:
(1) NHST didn’t develop in sociology or reflexology.  It was developed in the late 19th, early 20th century in genetics by people trying to extend the then novel ideas in On the Origin of Species.

(2) The whole point of statistics is to take a large set of data and extract a lower-dimensional, meaningful summary. The point is that the myriad of observed data contain information, but we can’t see the patterns in all the noise.  A p-value is the ultimate summary, taking all the data you observed and abstracting out a single bit of information. By using *any* statistics, you are over simplifying the problem because over-simplifications are useful.

(3) There’s a reason why you’d never get a paper published claiming an interesting finding that has a 25% chance of being wrong (and much larger chance of having a minimal effect size). If you *really* care about getting the answer right, 75% confidence is utterly meaningless.  You can’t confidently act on it one way or the other. These conclusions are really only useful to people (like sabermetricians) who have no practical use for the information they seek. No baseball team is going to change its behavior, with millions of $$ on the line, because of a p value of 0.25.

I think this is part of the reason why much of the baseball establishment sees sabermetrics as relatively useless; studies like this are fun for the researcher, and fun for readers, but they are little more than entertainment.  If we really care about increasing our understanding and guiding behavior, studies with p values of .25 provide little to no information.


#2    Tangotiger      (see all posts) 2009/05/19 (Tue) @ 10:51

cdm was responding to MGL’s post in the linked thread, which I will show here:

Very nice analysis. I’d like to see actual versus “expected” wOBA or OPS to see how much if any players are hurting their teams by apparently trying to continue their streak. I’d also like to see how these numbers change as the streak goes on - i.e., at what point do players and pitchers appear to change their approach to continue the streak or not, or do those changes in approach gradually increase, etc.

It is of course “sad” (for the opposing team) that these players are IBB’d so often when in fact they are likely worse hitters as they appear to alter their approach for the worse.

Sky I like how you point out differences and then merely state that they do not necessarily rise to the level of (arbitrary) statistical significance, rather than reject any differences that are not “statistically significant.” That is a bugaboo (among many) of mine - rejecting numbers that do not meet some arbitrary level of statistical significance. Stating the difference along with the standard error, as you did, should be all that a researcher needs to and should do. I don’t know where this convention of “accepting” or “rejecting” an hypothesis based on some arbitrary level of statistical significance came from, but it is a bad idea. The fact that some use 2 and some use 2.5 is a clue that something is wrong with the concept. I guess the concept comes from social science experiments where the researcher is “supposed to” do one of two things: Either accept or reject an hypothesis. As if there is not other alternative. In reality of course, the correct alternative is to simply assign a level of certainty to one hypothesis or another, or to recognize that the hypothesis or conclusion is often not an “either/or” but an effect along a continuum (from no effect to a large effect)…

I don’t understand cdm’s point #3 about 25%/75%.  MGL wasn’t arguing that, so it sounds like a madeup argument to me.

And I don’t understand how cdm says this:

These conclusions are really only useful to people (like sabermetricians) who have no practical use for the information they seek. No baseball team is going to change its behavior, with millions of $$ on the line, because of a p value of 0.25.

Exactly which saberist is supporting a p value of .25? 

If we really care about increasing our understanding and guiding behavior, studies with p values of .25 provide little to no information.

Agreed.  Who says otherwise?


#3    Tangotiger      (see all posts) 2009/05/19 (Tue) @ 10:54

MGL’s key point is the following:

Stating the difference along with the standard error, as you did, should be all that a researcher needs to and should do.

And this is correct.  There is no need to turn this into English words (unless the reader is expecting English words to be accompanied).  The numbers stand on their own.


#4          (see all posts) 2009/05/19 (Tue) @ 11:47

Exactly which saberist is supporting a p value of .25?

Sky was. The only place (that I saw) that Sky mentioned an effect size in the way MGL was applauding was here:

Perhaps the most surprising thing is the fact that the batting average during games where a player has a streak going is 20 points lower than his usual batting average. This statistic has a high standard error (.026), so due to small sample size, it’s hard to prove anything with statistical significance, but it’s still interesting

.20/.26 corresponds roughly to a Z score of 1, with a p-value of .25.  I didn’t do any math, I’m just ballparking.

MGL seemed to be saying that NHST is ridiculous, that there is no benefit to drawing a line in the sand, and that the t statistic (difference divided by standard error) is all a reader really needs to know.

Most people who use NHST would say that when you have a confidence of 75%, you don’t report the difference / standard error; you say nothing.  You can neither support nor reject your null. You have little to no information to add to the public understanding.

The danger in all of this is that you let your preconceptions, not the data, lead you in interpreting the results. It should be pointed out that Sky did reject a “marginal” p=0.12 in the goodness of fit test, because he had a good reason to think its wrong: as players’ streaks came to an end, that tended to leave better hitters, making the probability of continuing a streak higher as the streak gets longer.  That makes sense, but the data only support that conclusion if you reject a marginal p. 

This effect had a p value of .12, and yet Sky sky explicitly rejects it.  He does not explicitly reject an effect with a p value of .25, though.  He presents the stats with a caveat.  How do you explicitly reject p=.12, but entertain an effect with p=.25? Bias.

(btw, I don’t mean to attack Sky; I am not making any statements about the work.  I’m speaking hypothetically in defense of NHST, which has weaknesses, but doesn’t deserve quite as much disrespect as MGL implied.)


#5    Tangotiger      (see all posts) 2009/05/19 (Tue) @ 11:59

I think that’s a very tough read.  I see a 20 point difference and I think “interesting” in an english-sense, not in the statistical-significance-sense.

And I see no reason to paint all saberists based on this one reading either.  Indeed, you can’t paint any group ever based on the statements made by one person in that group.

Getting back to the article: it was a good idea, he put forth the effort, reported the results, and was interesting to read.  I don’t see it as it being anything more, and I don’t see this, at all, as anything that you would hold as an example of how the baseball establishment looks at saberists.

I WOULD use UZR.  Some in the baseball world agrees highly with it, and others disagree highly with it. 

I just don’t think cdm’s conclusions regarding saberists are warranted based on the evidence.


#6    MGL      (see all posts) 2009/05/20 (Wed) @ 01:54

I don’t really know how to respond to cdm’s post.  I am not sure there is any disagreement between us.  Again, I say report the results along with the p value and the reader can draw whatever conclusion he wants and/or the “user” (as in a team that may or may not make a decision based on the results of the research) can respond in whatever way he deems appropriate.

I realize that there are probably certain conventions/rules/etc. in terms of publishing and what have you, but I am not commenting on those or addressing them at all.

As far as Sky’s research, there is no need for him to “accept” or “reject” an hypothesis.  Whether he did or did not in the article (I don’t remember) does not and should not matter.  He presented the data, including the means for the reader/user to determine the uncertainty and that is all that is necessary.  Not only that, but for him to go further and declare an hypothesis as true or untrue is ridiculous!  What makes a p value of 1% “true” but one of 15% “false”? That is a rhetorical question because the answer is clearly, “Nothing.” No particular p value makes an hypothesis (null or otherwise) true or untrue.  By definition the smaller (or larger) the value the more likely a certain hypothesis is true or not true.

One of the problems with this kind of research as compared with a lot of social science research is that in the social sciences we often have an hypothesis and we are trying to determine whether it is true or not - often whether an effect exists or not.  It is an either/or, a dichotomous question.  In those cases, it is more appropriate to arbitrarily choose a p value - a line in the sand - such that we either accept or reject it, and hopefully the reader is aware of the arbitrariness of that assertion and that he uses the p value to draw his own conclusion and/or make his own practical decisions.

However, in baseball research, and in plenty of other disciplines of course, we often are not interested in an either/or answer.  We are often interested in the magnitude of an effect.  In those cases, it is not particularly appropriate to couch it in terms of an either/or answer.  An example (among MANY) is clutch.  Many researchers in the past have found little evidence of a clutch skill and using the “either/or” approach, they have rejected the hypothesis (or accepted the null hypothesis I guess) that a clutch skill exists.  Well, that ain’t right, as many people, including Bill James, have said over and over.  We sort of KNOW that a clutch skill HAS to exist to SOME degree since these are not robots at the plate.  So the real question is what is the likely magnitude of the effect? 

Andy in his research for The Book found it to be likely small, but likely larger than zero.  Had he simply used the either/or paradigm, I think he would have rejected the null hypothesis and declared that, “Clutch is a skill.” So how can we have all these researchers doing good research and some concluding that, “A clutch skill exists” and others concluding the opposite?  Because using the accept/reject approach is simply not appropriate in this case, and in many if not most cases of baseball research.  Again, report the effect you find in the empirical data, do your t-tests or chi-square tests or whatever, and report those results as well.  If you (the writer/researcher) want to attach some words to those results, fine with me.  It is just that as an informed reader, I am not going to pay much attention to those words, unless of course I don’t understand the numbers (which is a real danger).  Again, how can those words mean much when one researcher finds a p-value of .05 and rejects and another one accepts?

There is also the other issue which hardly anyone ever mentions, but I have to remind every once in a while.  And that is the issue of in a lot of baseball research there are prior probabilities which render the analysis Bayesian.  Sometimes a p value not including the priors can be extremely significant and other times a p value of .01 can mean nothing.


#7    Tangotiger      (see all posts) 2009/05/20 (Wed) @ 09:54

Here’s Phil on a related issue, quoting someone:
http://sabermetricresearch.blogspot.com/2009/05/dont-always-blindly-insist-on.html

“When I teach econometrics I tell my students that a sentence that begins by stating a coefficient is statistically insignificant ends with a period.” She tells her students that she never wants to see “The coefficient was insignificant, but…”

You can read the rest of Phil’s post. 

The one thing to remember is that this “teaching” makes sense only if you say “This particular study only showed significance with a p value of .20, buuuuuut, this was a poorly done study, because our prior is that we absolutely must have a p value of .00.”

As Phil notes:

Here’s an example with real data. I took all 30 major league teams for 2007, and I ran a regression to see if there was a relationship between the team’s triples and its runs scored. It turned out that there was no statistically-significant relationship: the p-value was 0.23, far above the 0.05 that’s normally regarded as the threshold.
...
And maybe that would be the case if we didn’t know anything about baseball. But, as baseball fans, we know that triples are good things, and we know that a triple does help teams score runs. That’s why we cheer our team’s players when they hit them. There is strong reason to believe there’s a connection between triples and runs.

So I don’t think it’s inappropriate at all to look at our coefficient. It turns out that the coefficient is 1.88. On average, every additional triple a team hit was associated with an increase of 1.88 runs scored.

Of course, there’s a large variance associated with that 1.88 estimate—as you’d expect, since it wasn’t statistically significant from zero. The standard deviation of the estimate was 1.53. That means a 95% confidence interval is approximately (-1.18, 4.94). Not only is the 1.88 not significantly different from zero, it’s also not significantly different from -1, or from almost +5!

But why can’t we say that? Why shouldn’t we write that we found a coefficient of 1.88 with a standard deviation of 1.53? Why can’t we discuss these numbers and the size of the real effect, if any?

Berri and his co-author would argue that it’s because we have no good evidence that the effect is different from zero. But what makes zero special? We also have no good evidence that the effect is different from 1.88, or 4.1, or -0.6. Why is it necessary to proceed as if the “real” value of the coefficient is zero, when zero is just one special case?

If you are testing to see if there is ANY relationship, then, yes, zero is the correct case, since that’s the question you are asking: “What is the chance THAT THIS STUDY SHOWS that the observed difference is not purely random, that is, different from zero.”

We have no prior expectation of anything other than zero, so we test on that basis.

For a triple, our expectation is that it must (not should, but must) be greater than zero.  It can never be zero or below (all other things equal). 

So, if you create a regression that doesn’t show that, then what you can conclude is that the study itself was poorly constructed.  GIGO.

That’s why, as MGL notes with clutch hitting, since we’ve got people involved, it’s illogical to think that some people will not change their behaviour if they perceive (consciously or subconsciously) a different environment.  If you have a study that says that “clutch hitting doesn’t exist”, what you are really saying is “I wasn’t able to construct a study to find clutch hitting”.  Which is two different things.

Practically speaking, if you have a well-constructed study, and you can’t find the effect, then all you can say is that, the effect, while sticking to our prior that it’s not zero, is simply not much different from zero, and therefore, doesn’t have much if any of a role for practical purposes.

The problem is the researcher believing that his study is well-constructed to begin with that would let him come to his conclusion.  That’s a mighty assumption that needs to be stated.


#8          (see all posts) 2009/05/20 (Wed) @ 10:15

Perhaps my reading of MGLs post was a tough.  One of MGLs pet peves is people dismissing hypotheses for not reaching .05.  My recent posts here have been motivated by my own pet peve: people being overly critical of frequentist approaches.

Tango puts a good spin on this:  Sky’s study reported no significant effects.  He reported trends which provide us with little information. He interprets them in a certain way, as did (I presume from the responses) most of his readers. But the reason they interpreted the data in this way is because of their priors. They learned nothing new, really, but the study gave the illusion of empirical reinforcement for their pre-existing beliefs. Science is all about disproving priors, not reinforcing them; the former is useful, the latter is useless.

Frequentist statistics will *never* allow you to disprove a hypothesis.  Null results are never meaningful. People who say “There is no clutch hitting” because they got null results are wrong.

But that doesn’t mean the framework of designing a (well-constructed, as tango says) experiment to test a hypothesis and setting an a priori threshold for a desired confidence level is wrong.  On the contrary, it is *extremely* useful (in a way Bayesian methods are not). That is why it is so widely used, despite it’s shortcomings.


#9          (see all posts) 2009/05/20 (Wed) @ 11:54

Tango/7: very well said.  You hit on a different topic that I was going to get to: the practical significance.  The goal is to get to the point where you say, “the effect is important and evidence suggests it’s real,” or “the effect either does not exist or is very small anyway.”

If you wind up with “the observed effect is large but there’s not enough evidence that it’s real,” then you have indeed not constructed your study very well.


#10    Sky Andrecheck      (see all posts) 2009/05/20 (Wed) @ 17:30

The study concerned players with long hit streaks, for which there is only a small amount of data.  There’s no harm in reporting the stats, even if there is not yet sufficient data to support statistically significant conclusions - especially if you mention several times that this is the case. 

The reason I mentioned the difference in BAV, is that the data show the opposite of most people’s priors - players with long streaks have seen their BAV’s go down, not up.  Is it noise?  Maybe, but that’s what the data shows so far.  Thus I conclude that there is “preliminary evidence”, but that more data are needed to really tell - I think that’s fair.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 01:43
Neal Huntington’s best moves

May 25 00:36
Help needed with sticky issue…

May 24 23:50
Rooting for laundry

May 24 20:16
Largest demonstration in Canadian history?

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards

May 24 08:13
espnW for hockey: CBC’s WhileTheMenWatch.com

May 24 00:16
Psst… wanna intern… somewhere?