Tuesday, March 08, 2011
Does Dave Righetti influence the HR rates of his pitchers?
Looks interesting. I’d like to see this split between home and road, and v LHH and RHH. I just don’t necessarily trust anyone park adjustments.
Glove-slap: Lee
Buy The Book from Amazon
Looks interesting. I’d like to see this split between home and road, and v LHH and RHH. I just don’t necessarily trust anyone park adjustments.
Glove-slap: Lee
He gives us the sigma’s (in the above chart, it is 1.86) for a few different subsets of pitchers. I’m not sure if they overlap (I didn’t RTFA carefully enough) or what the overall sigma is.
The reason I mention that should be obvious. Even if there is no coaching effect for HR/FB, one coach in 200 will be 2.5 sigma better than the rest, 1 coach in 40 will be 2 sigma, etc.
Please don’t ever look at a bunch of things and proclaim that because you found one of them that is statistically dissimilar from the mean, that it must mean something. It generally doesn’t. That is called data mining, cherry picking, publishing bias, etc.
I am not saying this is what is being done, but if the overall effect for one coach is even 2.5 sigma, without any other analysis of the distribution among all coaches (to see if indeed there is a “HR/FB coaching talent"), then it is meaningless. It could have been Duncan. Rothchild, Mazzone, or a hundred other known or unknown coaches.
It could have been hitting coaches, pitching coaches for K rates, BB rates, etc.
Publishing bias can really rear its ugly head. I caution everyone about it. What if over the last 5 or 10 years, there were dozens of researchers looking at all kinds of “coaching” angles? Someone would come up with a coach or manager who is 2 or even 3 sigmas from the mean for something, even if it was due to random chance!
So far in the last 20 years that I am aware of, we have Mazzone with ERA and Righetti with HR/FB. How do we know that is not publishing bias and that hundreds of similar things were looked at with no luck?
This kind of thing could be nipped in the bud by insisting on discussing uncertainty. If, instead of being allowed to say the Giants had a HR/FB skill that was 1.2% better than the league, he concluded Righetti is worth a reduction in HR/FB of 0.0% to 2.4% then everyone would pretty quickly realize that you can’t really draw any conclusions from that. And no one would be allowed to use that to say he’s worth over a hundred million dollars.
Larry, my point has nothing to do with the uncertainty level I am afraid. You are conflating two different concepts. Say I perform an experiment whereby I flip a coin 100 times, and I conduct that experiment 10000 times. It is likely that eventually I will have something like 65% heads or tails which would be 3 SD from the mean. The uncertainty around that is meaningless. I still don’t have any evidence of a biased coin because I data mined to get that result (and then committed an egregious case of publishing bias, by not reporting the results of all my experiments.
So while the sample size and accompanying sigma of the result are important, so is the number or potential number of experiments I (or even someone else) performed. Some people say (and they would not be wrong) that any published study based on sample data suffers from publishing bias because you must include the entire body of experiments that all researchers have done over the course of history…
With similar discussion of Matt Cain and Jose Bautista (as a hitter), I put together a WOWY from Gameday (2005-2010) to get around the need for park factors. I haven’t yet spent enough time with the results to see if there’s anything conclusive for the pitchers, but there might well be.
I was trying different combinations of things to hold constant. I went as far to look at all same handed pitchers facing the same batters, in the same ballparks, hitting the ball to the same field (lf,cf,rf), both ld and fb. Then I compared the observed to expected rates of hr/(ofld+lffb).
I didn’t see much difference in rates by handedness once the ball was contacted, so dropped that, and dropped individual batter identity. The list below is is based on park_id, bathand_cd, battedball_cd (ld or fb) and fld_cd (7,8,9)
>=700 balls contacted
bc obs exp ratio
Pelfrey 927 .052 .076 .681
Jurrjens 725 .057 .079 .717
Braden 729 .059 .082 .720
Buehrle 1799 .078 .106 .738
Wang 729 .056 .074 .756
...
ROrtiz 783 .107 .090 1.192
Colon 732 .096 .080 1.193
Wellemyer 833 .096 .079 1.220
BMyers 1198 .114 .091 1.248
Chen 837 .108 .086 1.256
I can send a copy of the source data to anyone who wants to play with it.
MGL:
I wasn’t responding to your point with my comment nor am I disputing that publication bias is a problem. My point was that relying on binary judgments of significance and then using the estimated value without error bars gives the results a precision that is completely unjustified. Adding in the uncertainty will impose a humility on the results. Publication bias is a indeed a problem over and above this.
I would dispute that the single trial uncertainty levels are meaningless in the coin-flipping example. They tell you that instead of concluding you have a 65% weighted coin, you have a 95% chance that you have a coin weighted somewhere between 55% and 75%. You’re a lot less likely to make a really dumb error with that info than with the conclusion you have a significant result of a 65% weighted coin. The 95% is also very important. It tells you that 5% of the time, the real value is outside the interval. It’s true none of this incorporates the actual large number of trials. But understanding this and being aware of it makes it much less likely you’ll make a major category error.
You missed MGL’s example of the 65% coin. He ONLY reported the one set of 100 flips where he got 65 heads. He did NOT report the other 9999 groups of 100 coin flips because the result of those flips was less than 65 heads.
Therefore, you cannot conclude that you can possibly have a biased coin, unless you know if this set of 100 flips was done once and reported once, or done 10,000 times and reported (non-randomly) once.
I didn’t miss it. My whole second paragraph addresses it. I acknowledge explicitly that the uncertainty fails to directly account for the large number of trials. I don’t disagree with your second paragraph at all, but I argue that using confidence intervals the mistake is a lot smaller. If you want to argue that deciding there’s an effect at all is the mistake, then I concede. It’s true, that mistake will be made unless publication bias is dealt with explicitly, but I don’t think I dodged that point.
The issue is how you deal with publication/data-mining bias when you don’t know if the number of implicit trials is 1, 100, 10,000 or 100 million. My argument is to start with acknowledging the uncertainty of 1 trial. It’s pretty big! And understanding where it comes from can help appreciate the problem of what happens if you’re doing it over and over in a way that rote significance testing does not.
I don’t think you really addressed it. When you say this:
“you have a 95% chance that you have a coin weighted somewhere between 55% and 75%”
this is predicated on the presumption that you have a random sample. But, what you really have is a Bayes problem:
“What is the likelihood that someone flipping a coin 100 times will report his results if he got less than 65 heads? What is the likelihood if he got at least 65 heads?”
If the answer is not the same, then you don’t have a random sample. And if you don’t have a random sample, you can’t use the fact that you have one SD = .05.
Suppose that the answer to my question is that you are 100 times more likely to publish a result of at least 65 heads (or at most 35 heads) than you are to report a result of less than 36-64 heads. That is, one time in 1000 someone will report their extreme results. But only one time in a 100,000 will someone report flipping 36-64 heads.
Given that prior, what is your mean estimate and uncertainty level?
Larry, both issues are important, however…
This statement is untrue of course:
“They tell you that instead of concluding you have a 65% weighted coin, you have a 95% chance that you have a coin weighted somewhere between 55% and 75%.”
There is not a 95% chance that you have a weighted coin since you only reported a small fraction of your experiments. In the original article about Righetti, the uncertainty level also does NOT tell us the chances of this being a random result without knowing the explicit or implicit number of experiments that were done.
I was wrong to say that within this context the uncertainty level was meaningless. I should have said that it only has meaning (in terms of the result being random) when considering the number of explicit or implicit experiments that may have been done. The uncertainty level alone does not tell you the likelihood of the result given the null hypothesis unless you are certain that this was the only experiment done, as you seem to imply.
In fact what you do when you are data mining is adjust that uncertainty level (the SD changes). The “equation” is, “Given the number of experiments I did, what are the chances that in one of the experiments, X result would occur if we accept the null hypothesis. In your coin example, rather than 5% or 2.5% (at 2 sigma), if I did 1,000 experiments, it is more like close to 100% (or whatever it is).
Tango/MGL:
Agreed. In the context of many experiments the statement, “They tell you that instead of concluding you have a 65% weighted coin, you have a 95% chance that you have a coin weighted somewhere between 55% and 75%.” is incorrect. It is incorrect in exactly the same way that reporting a positive result on the significance test is incorrect because the data-mining/many trials were unaccounted for. So, I concede that showing uncertainty doesn’t solve the problem of publishing bias at all.
I am saying that confidence intervals have several effects that might reduce the impact of the problems caused by publication bias. First, passing a significance test (that doesn’t account for publication bias) will no longer be a license to treat the result as exact. Second, the term confidence ought to help researchers realize that it isn’t 100% certainty. Third, results that have p-Values of 0.05 or 0.025 will have large confidence intervals. These results are also the ones most likely to be a result of publication bias and so the conclusions drawn from them will be much more muted than without confidence intervals.
All that said, a quantitative approach, adjusting the significance test and the uncertainty to account for the ensemble of tests is preferred.
Last, I hope to answer Tango’s question given his proposed publication bias. But, it’ll take a little time. It isn’t trivial.
1. I agree with all the comments above.
2. As MGL/4 and Larry/9 note (but MGL/10 sort of contradicts in the last paragraph), a simple statistical correction for publication bias is unavailable, given the intractability of defining the family of studies whose error rate you’re attempting to control (e.g., all studies ever, all studies about baseball, all studies about pitching coaches, all studies I did on pitching coaches?).
3. The difficult thing with these sorts of analyses is not simply the publication bias problem, but the repeatability problem. Contrast clinical trials--rife with publication bias--but the best correction for the bias is to have other folks run the same trial. But that’s not possible here.
4. And, just to be clear…
Say I perform an experiment whereby I flip a coin 100 times, and I conduct that experiment 10000 times[...] I still don’t have any evidence of a biased coin because I data mined to get that result (and then committed an egregious case of publishing bias, by not reporting the results of all my experiments.
There’s an ambiguity here:
(a) performing the experiment on the *same* coin (or coins we know to be identical) 10000 times and only reporting one outcome. This is fraud, not data mining.
(b) performing the experiment 10000 times on 10000 different coins. This isn’t exactly data mining either.
(b) above is indeed a classic case of data mining, whereas, as you say, (a) is just fraud (or whatever you want to call it).
In my quest to find a biased coin, all I have to do is conduct thousands of experiments on thousands of different fair coins and I will likely find one experiment and one fair coin that is 2 or 3 sigmas from the mean.
(a) would be like looking at different subsets of data that Righetti might have an influence over and reporting any that might differ from the null H. (b) is looking at lots of coaches and lots of different ways that coaches can influence player stats and then reporting one that is anomalous. That is data mining.
BTW, data mining can be OK if once you find some candidates, you are able to conduct the same or similar tests on out of sample data. As you say, that is usually not the case with baseball research as we usually exhaust the data on the first pass.
BTW, you can only account for the data mining that YOU do in adjusting the value of the standard errors. You can’t adjust for the whole body of research that may or not be published. And it would not be appropriate to do that. You have to leave that to others who are evaluating the research…
MGL/13
Thanks for the clarification.
1. Amen on your point that data mining isn’t, in itself, bad--it’s a fine way of generating hypotheses to test. And I think a lot of these issues disappear if take this research to be hypothesis-generating.
2. *However* I don’t think it’s always a straightforward matter as to how we should interpret the findings in a case when your experiment is like (a):
looking at different subsets of data that Righetti might have an influence over and reporting any that might differ from the null H.
*If* you have good reason to believe that the multiple tests on Righetti (HR/FB, injury, etc.) are statistically independent, then there is really no reason to think that your experimental analysis is relevantly similarly to running a thousand versions of the same test on the same coin.
If independence holds--this is a *big* if, since a lot of us think pitching coaches’ ability to influence pitchers should show up in more than one variable--correcting for the fact that you’ve done other tests would be a mistake.
"If independence holds--this is a *big* if, since a lot of us think pitching coaches’ ability to influence pitchers should show up in more than one variable--correcting for the fact that you’ve done other tests would be a mistake.”
Sorry, that is wrong. The dependence/independence doesn’t matter. If I conduct 100 unrelated experiments, what are the chances that I get a 2 sigma (or more) result in at least one?
1-.95^100 = 99.4%
I should add, “by chance - assuming that the null hypothesis in all 100 experiments is exactly true...”
Thus, what are the chances that I am going to make a Type I error (rejecting the null hypothesis when true)?
The dependence/independence doesn’t matter. If I conduct 100 unrelated experiments, what are the chances that I get a 2 sigma (or more) result in at least one? 1-.95^100 = 99.4%
I really do get this, despite all appearances to the contrary.
The intuition behind my wrongheadedness. Consider these two cases:
(a) One hundred toin-cossing experiments on fair coins.
(b) One hundred experiments on 100 unrelated hypotheses (coins, chemotherapeutic agents, earthquakes) by 100 different experimenters.
We know that a single two-sigma result in (a) should not be accorded very much weight.
Should a single two-sigma result in (b) be treated just like the two-sigma result in (a)?
I honestly have no idea--the same logic *seems* to hold for discounting in (b) as in (a). Given 100 independent experiments of disparate phenomena, if null H were true in all cases, we could expect at least one 2 sigma result 99.4% of the time.
But it doesn’t seem right to discount a two-sigma result in a microbiologist’s 1st-ever experiment on penicillin because 99 other scientists have performed tests on earthquakes. *But I guess we should*? (Type II error rate notwithstanding)
Anyhow, if the 100 variables we’re testing on Righetti are *truly* independent--very very very unlikely--then (c) seems to me like (b), rather than (a).
Again, it may not matter that it’s just like (b).
But feel that, for clarity’s sake, it’s worth pointing out that it is more analogous to (b) than (a).
The answer to your question is yes, (b) and (c) are just like (a). One needs to be appropriately skeptical of that penicillin result. If it can be repeated a few times independently, then you can be more certain.
That said, I want to say that I don’t think this is the biggest problem with this study. The question here shouldn’t have been what was the chance of Righetti having a 2-sigma result, but what was the chance of the largest of the 5 coaches studied having a 2-sigma result. Even that isn’t correct, since the study really looked at the largest of all HR/FB performances of all coaches over the 9 years (at least 30 opportunities, 1 for each team). I think the significance test is going to fail there even without going to this discussion of publication bias.
Then there’s the problem of assigning a precise monetary value to the determined contribution.
"But it doesn’t seem right to discount a two-sigma result in a microbiologist’s 1st-ever experiment on penicillin because 99 other scientists have performed tests on earthquakes. *But I guess we should*? (Type II error rate notwithstanding).”
That is a great example! It is somewhat of a philosophical issue that has confounded smart people for a long time.
Yeah, theoretically, of the null hypothesis is true for all 100 of those experiments, someone is likely to have a 2 sigma result. Today it could be the microbiologist, tomorrow it could be the one of the earthquake guys (seismologists? geophysicists?).
That is one reason you often get so many varied and contradictory results in common experiments. Not only can the methodologies (and other things) be slightly different among them, but there are Type I and Type II errors left and right if the experiments and studies are done often enough…
May 25 05:00
Help needed with sticky issue…
May 25 04:38
The first time a pitcher has ever intentionally thrown at a batter….
May 25 03:39
Lack of hustle during a game
May 25 02:54
Largest demonstration in Canadian history?
May 25 02:38
NFLPA lawsuit against collusion
May 25 01:43
Neal Huntington’s best moves
May 24 23:50
Rooting for laundry
May 24 17:04
Firefox, IE, or Chrome?
May 24 12:07
How to beat the shift
May 24 11:11
Incredible story
I responded to this there in the comments at Fangraphs a little while ago. This is what I wrote:
I just checked Fangraph’s database, hoping to compare the Giants’ 2001 HR/FB rate vs. their 2002 HR/FB rate but, unfortunately, that stat only started tracking in 2002. So I can’t compare the Giants to the year before.
However, I did do something else that I thought was worthwhile, and that was to check how the Giants HITTERS did in HR/FB. Because if the hitters also had an unusually small rate, then I think that a large factor in the cause of this unusually low HR/FB rate would be the ballparks that they were playing in.
The Giants hitters had the sixth-lowest HR/FB rate from 2002 to 2008 of the 30 MLB teams. While the Giants pitchers only gave up 8.6 home runs for every 100 fly balls, the Giants hitters only hit 9.2 for every 100 fly balls.
The ML average for home runs for every 100 flyballs was 10.4. So the Giants pitchers deviated from that at about -1.8, while the Giants hitters deviated from that by about -1.2. (Stat note: that’s just deviation from the mean; I’m not too good with standard deviation.)
Because of how close the Giants hitters were to the pitchers, I think that it’s almost certain that the ballparks (home and road) played a significant factor in the home run suppression – at least as big as any other factor. That’s my opinion.