Sunday, October 09, 2011
How not to do a study
Yesterday, I made a point of saying that technically, you cannot include in your control dataset the data that you are actually studying. And that while practically, if the data in question makes up only 5% of the whole control dataset, you don’t need to worry about it, you still need to point out this bias.
Why? Because today I see someone do this.
For 2011, BB-REF PI lists 504 starts where the starter pitched at least 24 outs, which I choose so as not to bias the results by looking only at complete games. The 504 starts were distributed among 140 pitchers.
When I add up the game data for these starts, I get the following line:
W L W-L% ERA GS CG SHO IP H ER HR BB SO WHIP
351 82 0.811 1.23 504 169 73 4,189.2 2,532 572 197 678 3,220 0.77That seems to be a pretty good line to me, and would argue keeping a starter in for the 9th inning who has already gone 8 IP, in the absence of any other data.
But he DID bias the study.
There are two things being said here, both important to note. The first is that the poster selected all pitchers with at least 24 outs recorded. And among those pitchers, they had an ERA of 1.23 for the whole game. This should come as no surprise. If I had selected all pitchers who pitched in a game their team won, you’d also see some ridiculously low ERA.
The second thing the poster said is this:
“keeping a starter in for the 9th inning who has already gone 8 IP,”
Now, if you have someone who pitched through 8 innings, this guy may have had an ERA of 1.00 through 8 (for illustration purposes only). If he ends up with an ERA of 1.23, does this prove that you made the right choice to let him pitch in the 9th?
In the above line, there were 504 pitchers, meaning that in their first 8 innings (4032 innings), they would have had (if the completely made up ERA of 1.00 is true) 448 ER. Those pitchers actually ended with 4190 innings and 572 ER. That means that AFTER the 8th inning, they pitched an extra 158 innings and allowed an extra 124 ER.
So, when you look at the poster’s comment: “keeping a starter in for the 9th inning who has already gone 8 IP,”, we see that in this illustration it was a horrible choice.
Now, what if instead, the pitchers through 8 had an ERA of 1.25? In that case, in their 4032 innings, they’d have allowed 560 ER. And therefore, in their performance AFTER the 8th inning, they’d be at 158 more innings and 12 more ER, for a miniscule ERA of 0.68.
What does all this mean? Well, the first thing you need is your control dataset, which in this case is performance through 8 innings. This is the group of pitchers you are interested in. And since the question is: “how did the guys who pitched 8 do in the 9th”, then what you need to study is the out-of-sample data: the performance in the 9th inning.
The above dataset from the poster doesn’t help us. The poster presented us with a combined dataset of the in-sample dataset, the data that we are selecting on, and the out-of-sample dataset, the data that we are interested in studying. By just coming up with the dataset as he did it is a sampling bias.
As my two illustrations showed, we have no idea if the pitchers allowed a tons of runs in the 9th, or hardly any. And what we care about, what we are testing, is their performance in the 9th inning.
Let me give you another one: in 2011, there were 141 pitchers that pitched 9 innings. Their ERA was 0.65. Does this mean that these guys would be fantastic candidates to pitch in extra innings? And, if they did, suppose they each pitched one inning, and gave up one run in extra innings. Well, now their ERA would be 1.48.
If all I did was tell you that there were 141 pitchers that pitched 10 innings and their ERA was 1.48 through 10 innings, would you therefore conclude that the manager was correct in letting them pitch 10? If this is the only information I gave you, then you couldn’t come to any such conclusion.
This is why it’s important to separate the data that you sample on, and the data you are actually testing. Even in cases where it makes impractical sense to do it, you should still do it. Because many times, like here, a poster forgets, or is unaware, of this.
Thank you to the poster for providing the source material, which was used for instructive purposes.


Recent comments
Older comments
Page 1 of 344 pages 1 2 3 > Last »Complete Archive – By Category
Complete Archive – By Date