Monday, June 12, 2006
The Secret Recipes of the Run Expectancy Matrix
Trying to make some sense out of the Run Expectancy (RE) Matrix, I’ll show you how to Deconstruct and Reconstruct an RE matrix, without play-by-play data.
When you look at this page:
http://www.tangotiger.net/RE9902.html
does it mean anything to you? On the surface, it should at least tell you that this represents the average number of runs that scored from a particular base/out state, from 1999-2002, to the end of the inning. This is an important chart to understand, and represents the core chart to everything from Leverage Index, Win Probability, Linear Weights, and everything in-between. To better appreciate those concepts, you should spend alot of time on the core chart.
The Book also has an entire chapter devoted to Run and Win Expectancy, and the interested reader is recommended to pick up a copy. It also explains in detail how the numbers are derived.
But, is there a shortcut, especially for years with limited data? You bet! And the shortcut I’m about to present is the absolute easiest shortcut with the absolute minimum amount of data. What we are about to do is create an RE chart from the ground-up, using basic logic. Ready?
Runs Scored = number of baserunners time % of baserunners that score plus home runs
which we’ll write as:
R = baserunners * ScoreRate + HR
or
R = br * sr + HR
This is a truism. If you have 10 baserunners in a game, and 30% of them score, and you hit 1 homerun, how many runs scored? Four. 10 x .30 + 1 = 4. Got it? Keep that aside for now.
Also remember the 3,2,1 rule. With 0 outs, you have 3 times more chance of scoring than with 2 outs. With 1 out, you have twice the chance of scoring as with 2 outs. So, if you have, on average, a 30% chance of scoring, you probably have a 45% chance with 0 outs, 30% with 1 out, and 15% with 2 outs. Keep that aside for now, too.
Let’s take the 1994 data:
http://www.retrosheet.org/boxesetc/YS_1994.htm
We see that 15752 number of runs scored in 28586.1 innings, for an average of 0.551 runs per inning. There was also 3306 homeruns, for an average of 0.116 HR per inning. Let’s go back to our truism:
.551 = br * sr + .116
which means that:
br * sr = .551 - .116 = .435
That .551 goes into the “bases empty, 0 outs” cell.
Now, how about with bases empty and 1 out? That is, if baseball was only 2 outs in an inning, and if you only had the data above, how many runs would score per inning?
Well, if you have a 45% chance of scoring with 0 outs, 30% with 1 out, and 15% with 2 outs, then the average is 30%. But, if you are coming to bat with bases empty and 1 out, then a baserunner will either score 30% of the time (with 1 out) or 15% of the time (with 2 outs), or 22.5%. As you can figure out, this two-out figure is 75% of the three-out figure.
As well, with two-thirds of the inning left, you will only have two-thirds of the baserunners, and two-thirds of the homeruns.
R = (br * 2/3) * (sr * 3/4) + (.116 * 2/3)
R = (br * sr) * (1/2) + (.116 * 2/3)
Since (br * sr) = .435 for this particular year, we get:
R = .435 * (1/2) + (.116 * 2/3)
R = .295
That .295 goes into the “bases empty, 1 out” cell.
Finally, how about with 2 outs? That is, if you are down to the last out for the inning, how many runs will you score with the bases empty?
R = (br * 1/3) * (sr * 1/2) + (.116 * 1/3)
R = (br * sr) * (1/6) + (.116 * 1/3)
R = .435 * (1/6) + (.116 * 1/3)
R = .111
So, the first line of our RE chart for 1994, for bases empty, reads:
.551, .295, .111
The actual 1999-2002 data, for bases empty, for a slighly higher scoring environment, reads:
.555, .297, .117
Pretty cool, right? The 1994 line was determined using exactly two pieces of information: runs per inning and HR per inning. The 1999-2002 line was determined using 800,000 plays. Imagine, we will be able to fill out the entire RE matrix, without even looking at the play-by-play data. So far, we’ve got the first line down. And this was really the hard part. Once we can complete this, we will then be able to generate Linear Weights values for all events for any team or year in history. Stay tuned!
***
Updated: June 13 - Chances of Scoring
Start thinking about the RE chart loudly, and take each number, one at a time. The first is the .555. That’s the average number of runs that score to start the inning. Now, suppose I told you that having a guy on 1B with no outs will score that guy 39.8% of the time. Since we know that, from the batter at the plate to the last batter for the inning, the team will score .555 runs, then we need to add the .398 runs that the runner on 1B represents. .555 + .398 = .953. This number is the RE with man on 1B and no outs. And we see this number in the chart. If you subtract the “1B” line by the “Empty” line, this will give you roughly the percentage of times that the runner on 1B will score with 0,1,2 outs. Those numbers are, respectively, .398, .276, .134. Pretty neat, right?
Do the same thing with the “2B” line. The chance of scoring from 2B with 0,1,2 outs is: .634, .428, .227. And from 3B it’s .927, .686, .270.
Now, it doesn’t exactly work that way due to selective sampling, and sample size issues, but it’s a pretty good start.
It’s now time for a quiz. What’s the RE with man on 1B and 2B and 0 outs? We know the chance that the runner on 1B will score is .398, so that’s how much run expectancy (RE) he gets. The runner on 2B is worth .634. The batter at the plate, and all subsequent batters to the end of the inning, are worth .555. Add them up: .398 plus .634 plus .555 gives you 1.587 runs. Our chart tells us 1.573. Pretty close!
Theoretically, you should be able to construct the entire RE chart for any era knowing only: the number of runs scored per inning (which we do know for any team or year in MLB history), and the chance that a runner will score from any base/out (which we don’t know).
(Note: The dependence of the runners on base, say bases loaded, makes it that it’s not such a simple additive exercise. But, it’s a great starting point.)
How can we figure out the chance of a runner scoring from any base/out? Stay tuned!
***
Updated: June 13 - Scoring from 3B
Of the various base/out states, the easiest one to figure out the chance of scoring is the man on 3B, no one else on, and 2 outs. What has to happen for this runner to score? Well, the batter has to get a basehit, or reach base on error. If he’s out, the inning is over. If he walks or is hit by a pitch, it’s a “do-over”, unless you get three consecutive “do-over” plays.
In short, your chance of scoring from 3B = batting average + error average + “3 consecutive walks” average
The batting average is whatever it happens to be for the league (make sure to count SF as a regular out). It was .267 in 1994. The error average is typically around .015. The do-over plays occur 10% of the time, which means 3 consecutive do-over plays is .001. So, the chance of scoring from 3B with no one else on and two outs in 1994 is .283.
Using just the RE tables, we were expecting .270, which is close enough.
How about with one out? This one gets trickier, since we need to know about sac flies, or ground outs that score the runner. Let’s say we don’t know about those pieces of information. The chance of this runner scoring on a hit, error, or “do-overs” is .283 + (1-.283)*.283 = .486. However, sac flies and the like does give our runner a chance to score with 1 out.
The strikeout rate in 1994 was .178 SO per (AB+SF). The batting average + error average is .267 + .015 = .282. That leaves the groundball and flyball rate as 1 - .282 - .178 = .54
Let’s assume that 50% of the time, the runner from 3B will score on an out with less than two outs.
Maybe it should be 40%. It’s not too important right now, but will be if you want to be serious about your RE charts. We just need to come up with a reasonable number, and a quick research will give us the result. This becomes important in the older years, where parks were much different, and the aggressiveness of the runners was different. Using the SF, if you have it, would definitely help. A little rolling-up-the-sleeves work is warranted here.
So, we have chance of scoring on 1 out = .283 + .27 = .553
Chance that he will still be on 3B with 2 outs = 1 - .553 = .467
Chance that he will subsequently score: .467 * .283 = .132
Add it up: .553 + .132 = .685
Using just the RE tables, we were expecting .686. Nice, right?
Finally, how about 0 outs. Repeating the same process:
Chance of scoring on 0 outs: .553
Chance that he will be on 3B with 1 out: .467
Chance that he will subsequently score: .467 * .685 = .320
Add it up: .553 + .320 = .873
And the RE tables told us .927. There is a gap here, but we are at the mercy of our sample size. If you refer to Table 9 in The Book, you will see that our expectation should have been around .87.
All we have to do now is add the .873, .685, .283 figures to the “bases empty” line of .551, .295, .111.
Our “man on 3B” RE line reads:
1.424, .980, .394
You just have to keep following the same process for all the other lines. But, it’s going to start getting complex, but not complicated.
***
Updated: June 15 - Scoring from 1B, 2B
Rather than going through the complex process right now, let’s look at it from the other viewpoint that I brought up earlier. Remember I started off with this truism:
R = br * sr + HR
And in 1994, on a per inning basis, I said that:
br * sr = .435
This means that if you had 1 baserunner (br) per inning, then you expect the average baserunner to score 43.5% (sr) of the time. If you had 1.5 baserunners (br) per inning, then your expectation is 29% (sr).
br * sr can be broken down as:
br * sr
= br1 * sr1
+ br2 * sr2
+ br3 * sr3
= .435
Where the “1,2,3” signifies the base. So, “br3 * sr3” is the number of initial baserunners at third base times the chance that the runner on third base will score. We know what sr3 is equal to, it’s the average of:
.873, .685, .283
which is .614
As for br3, that’s pretty much the number of triples per inning, which in 1994 was .025. (We should also count those baserunners that got on third base via error, but, those are rather rare, and just gets in the way at this point.)
So, br3 * sr3 = .015
This leaves us with this equation:
br1 * sr1
+ br2 * sr2
= .435 - .015
= .420
We’re getting close. Stay with me.
From 1960 to 2004, using the RE charts provided by Tom Ruane at Retrosheet, the gap between the chances of scoring from second base (sr2) and scoring from first base (sr1) was .170. And, this is true, regardless of the run environment. The top 20 run environments of the 90 league-seasons in this time period averaged .552 runs per inning (5.0 runs per game), and the gap (sr2 - sr1) was .167. The bottom 20 run environments averaged .425 runs per inning (3.8 runs per game) with a gap of .168. Even if I limit it to the top and bottom 5 run environments (5.3 runs per game, gap of .162, and 3.6 runs per game, gap of .175), we are essentially at a no-gap scenario. The correlation between the 90 run environments and the 90 gaps is an r of .004, which again, is essentially zero.
Our equation is now:
br1 * sr1
+ br2 * (sr1 + .170)
= .420
Which is
sr1 * (br1 + br2)
+ .170 * br2
= .420
Conveniently, and just a little bit more dangerously, ignoring the reaching on second base by error, the initial runners at second base was the doubles per inning rate of .200 (that’s br2).
Our equation is now:
sr1 * (br1 + br2)
= .420 - .170 * .200
= .386
Almost there. How many baserunners were there in 1994? Excluding triples and homeruns, br1 + br2 was around 39285 per 28586 innings for an average of 1.374. (Estimate based on hits minus 3b minus Hr plus walks plus hit batters plus interference plus reached on error.)
sr1 * 1.374 = .386
This makes sr1 = .281
That’s it. The chances of scoring from first base, in 1994, is .281. The chances of scoring from second base is .281 + .170 = .451. Applying our 3,2,1 rule, and our chances of scoring from 1B with 0,1,2 outs is:
.421, .281, .141
Making out “1b only” line as:
.972, .576, .252
Our chances of scoring from 2B is:
.677, .451, .226
Making out “2b only” line as:
1.228, .746, .337
The other lines follows a similar process, giving us this estimated RE chart for 1994:
EST_1994___0_______1_______2
Empty____0.551___0.295___0.111_
1st______0.972___0.576___0.252_
2nd______1.228___0.746___0.337_
3rd______1.424___0.980___0.394_
1st_2nd__1.649___1.027___0.478_
1st_3rd__1.845___1.261___0.535_
2nd_3rd__2.101___1.431___0.620_
Loaded___2.522___1.712___0.761_
And what did Ruane at Retrosheet calculate for us?
RE_1994____0_______1_______2
Empty____0.549___0.300___0.116_
1st______0.936___0.565___0.258_
2nd______1.172___0.728___0.367_
3rd______1.464___0.999___0.419_
1st_2nd__1.595___0.936___0.472_
1st_3rd__1.767___1.175___0.572_
2nd_3rd__2.045___1.470___0.694_
Loaded___2.391___1.573___0.798_
Now, it does look a little off. A few things conspired against us. First, counting the reached on error as being from first base only. Secondly, not accouting for runners who reach base on sacrifice hits. If we had left on base (LOB) data instead, this process would have been a bit better. After all, PA = R + LOB + Outs. That is, every batter is either a run scored, left on base, or putout.
You now have the basis to calculate the RE for any team or league.
More to come…
Awesome stuff Tango - I can’t believe it is so close to the Markov approach ... I’ll probably give it a crack for 2005 data