Saturday, January 17, 2009
Inning-level Linear Weights
I’ve wanted to do this for the longest time, and finally got around to doing it.
From 1993 to 2008, there have been nearly 600,000 3-out innings, through 8 innings of each game. Here is how many runs are scored in each of those innings, when broken down by the number of HR hit in that inning:
HR R_PER_I N
0 0.353 528611
1 1.983 59475
2 3.439 5319
3 4.961 413
4 6.778 36
5 7 1
So, when no HR are hit, which happened in over half a million innings, there were 0.353 runs scored. When exactly 1 HR is hit in an inning, there were 1.983 runs scored. The difference, 1.63 we can attribute to the HR.
Well, not totally, since we didn’t control for the other events (singles, doubles, walks, etc). While we’d like to think those would be random for each HR class, the reality is that innings that have a HR hit will likely let us infer that it was disproportionately allowed (or hit) by teams/pitchers predisposed to allow (or hit) other events too. Let’s keep going for the moment.
The number of runs scored with 2 HR hit in an inning is 3.439, which is 1.46 more than when 1 HR is hit. When 3 HR are hit, there are 4.961 runs scored, which is 1.52 more than when 2 HR are hit.
Generally speaking, we can say that each HR adds something like 1.4 to 1.6 runs.
Let’s repeat with triples:
3b R_PER_I N
0 0.518 580901
1 1.866 12711
2 3.441 238
3 4.8 5
When one triple is hit in an inning, there are 1.35 more runs scored than in innings when no triples are hit. I have to admit that this is much higher than expected, even though we have over twelve thousand such innings.
Doubles:
2b R_PER_I N
0 0.352 486603
1 1.231 93687
2 2.676 12095
3 4.134 1343
4 5.426 115
5 7.6 10
6 6 2
When one double is hit in an inning, there are 0.88 more runs scored than when no doubles are hit. However, when two doubles are hit, there are 1.45 more runs scored than with one double. The pattern repeats itself with three doubles (1.46 more runs scored than with 2 doubles). Why so much? Well, if you know that you hit two doubles, then you are guaranteeing one run scored, plus whatever other runners you get to knock in. You are not getting the random events that you are hoping for.
Here is it for singles:
1b R_PER_I N
0 0.207 316712
1 0.543 183406
2 1.259 66297
3 2.366 20036
4 3.54 5539
5 4.745 1410
6 5.965 346
7 7.15 80
8 8.72 25
9 10 4
As you can see, that first single in the inning only adds 0.34 runs, but that second single adds .72 runs, and the third adds 1.11, while the fourth adds 1.17. The fifth adds 1.20 runs, and the sixth is 1.22 runs. As you can see even more clearly here, if you can bunch up singles in an inning, you’ll get quite the compounding effect.
The weighted average of all these numbers is +0.505 runs, which is alot more of what we expected. The weighted average for the extra base hits was:
2B +0.951
3B +1.352
HR +1.615
Here are the numbers for non-intentional walks:
bb R_PER_I N
0 0.369 425646
1 0.818 134835
2 1.529 27682
3 2.475 4784
4 4.013 783
5 5.647 102
6 7.158 19
7 8 3
8 16 1
The first walk adds +0.45 runs, while the second walk adds +0.71 runs. The third adds +0.95 runs, and the fourth is +1.54 runs. Similar to the single, if you can bunch up walks, you have a devastating effect on the runs scored. Interestingly, the weighted average is +0.512 runs, which is MORE than the single! Almost certainly, if you see alot of walks, you can infer far more about the talent level of the pitchers or hitting team than if you see alot of singles. We’ll get to that in a few minutes.
Here we have hit batters:
hb R_PER_I N
0 0.52 578497
1 1.566 14970
2 3.021 384
3 4 4
The first hit batter adds +1.05 runs! That is insanely high, and really, impossible. What can it mean? Almost certainly, it means that alot of hit batters are not random, and, as baseball fans would suspect, are linked to HR. So, that hit batter, in and of itself, should be worth similar to a walk. But, it carries extra hidden information, which we can infer that there are more HR hit in innings in which a hit batter is allowed, than otherwise. The weighted average is +1.06 runs.
Here are the numbers when a batter reaches base on error:
er R_PER_I N
0 0.511 569234
1 1.347 23823
2 2.734 772
3 3.808 26
We can also infer that the fielding team is not that good. The first error adds +0.84 runs, with a weighted average of +0.85 runs.
Just to conclude this section, here is the intentional walks:
ib R_PER_I N
0 0.52 571721
1 1.234 21441
2 2.299 672
3 3.667 21
The first IBB adds +0.71 runs, for a weighted average of +0.73 runs. As you can guess, our inference is that IBB are allowed when other runners are on base. And so, the +0.7 runs being added are not directly attributable to the IBB itself, but to the KNOWLEDGE that an IBB has been issued, implying the value goes partly to the IBB and partly to everything else that is not random.
***
Now, let’s take care of all that non-randomness with regression. Taking the 38,830 3-out innings of 2008 only, I get the following coefficients for the regression (r=.875):
+0.36 BB
+0.51 1B
+0.53 Error
+0.78 2B
+1.02 3B
+1.42 HR
Now, those numbers look VERY NICE. They are pretty much what we expected, give or take .03 runs.
However, and this is why we don’t want to be a slave to the regression, the coefficient for the hit batter is +.26 runs, and for the IBB, it’s +.43 runs. The IBB is especially ridiculous, since they are given out with 1 or 2 outs, and so, don’t have as many opportunities to score. The standard error is .013 runs, meaning that we are 95% sure the run value is between .40 and .45 runs. Like I said: ridiculous. You have to use the regression as a tool, and not be a slave to it. You must be tempered by your baseball senses.
Let’s repeat the regression, but on 2007 data. We have 38,876 3-out innings, and the results of the regression (r=.877):
+0.37 BB
+0.51 1B
+0.58 Error
+0.78 2B
+1.02 3B
+1.39 HR
Most stayed the same, but notice how the reaching base on error jumped? The standard error was only .013, so it is fairly different from the 2008 value. The hit batter again comes in low (+.27 runs) and the IBB comes in high (+.40 runs). An additional reason for the high run value of the IBB is that the subsequent batters may be good hitters, and for the hit batter, you can have the opposite conditions. Well, that’s the story of what the regression might be telling you.
On the other hand, if you do it the right way, and look at run expectancy charts, as detailed in The Book, we don’t get these egregious differences.
We’ll try one more, for 2006. With 38,824 innings, and an r=.879, we get:
+0.37 BB
+0.51 1B
+0.60 Error
+0.77 2B
+1.07 3B
+1.38 HR
Check out the jump for the error. Otherwise, things were pretty stable. Hit batter again came in low (+.27) and IBB high (+.41).
What we’ve laid out here is a progression of analysis techniques of sorts. The first part simply presumes “all other things equal” (which is not true) to show you the impact of each subsequent event. The second part uses regression which (tries to) account for the “other things”. It succeeds mostly, but not totally. The third part was detailed in The Book, which is the best way to do it.


It is nice to see that the regression weights are reasonable on the innings level. And I would think it would be a blow to those who believe that season-level regressions provide any particular special insight into how runs are scored (this by now is a minority viewpoint, but I’m sure there’s still a few hardliners out there).