THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, July 23, 2007

The fallacy of Pythagorean

By Tangotiger, 12:33 PM

Credit SABRMatt with opening my eyes to the impact.

Suppose you have a game like yesterday:


The Yanks went nuts and scored over 20 runs.  Suppose that game was followed with a shutout.  On average, they scored over 10 runs a game.  On average, they won 1 and lost 1.  Doesn’t make sense, right?

Here, let’s make it more technical and perfect:
http://www.tangotiger.net/markov.html

Set the AB to “24”, and we get this line:
AVG / OBP / SLG
0.417 / 0.500 / 0.625
Telling us that they will score 14 runs, over a 9 inning game. 

Now, set AB to a large number.  You will obviously get this line:
AVG / OBP / SLG
0.000 / 0.000 / 0.000
And you can guess the number of runs in a game.

The first game, the .500 OBP game, means you were on base 27 times and made 27 batting outs.  The second game, the perfecto, means you were on base 0 times and made 27 batting outs.  After two games, you got on base 27 times and made 54 batting outs, for a 0.333 OBP.  (A .333 OBP implies 4.7 runs per game.)

After two games however, you scored 14 runs total, or an average of 7 runs per 9 innings.

You see the disconnect here?  Now, given a large enough games, all these wild and crazy games will balance out.  Now, by large, I mean LARGE, not 81 or 162.  I’m talking about several seasons worth.

For this reason, it makes no sense to use the average runs per game to establish the Pythagorean record.  You should convert the runs figure down to something bases-like, or convert it up to something wins-like.  A game where you score 14 runs total will give you a winning record of around .900, and a game where you are perfected-out will give you a winning record of .000.  The average of the two is .450.  Not quite the .500 we are looking for, but far better than around .700 a winning record that would be implied by taking the average of 14 and 0 runs, and then converting to wins.

So, the best solution is to convert to something OBP-like, the next best solution, very very close behind would be to convert to something wins-like.  The third best solution, far behind, would be to stick to the cumulative runs scored and allowed figures.

Thanks Matt.

#1    Patriot      (see all posts) 2007/07/23 (Mon) @ 13:08

I’m not sure I understand what you are getting at.  I have always thought that the purpose of a Pyth estimator was to say “given that this team scores X runs per game, and given that they have a usual distribution of runs scored across games, they will win Y games”.

Now obviously 20 and 0 is not a usual distribution.  And since the hits were all bunched together in one game, the number of runs far exceeds the number of expected runs. 

As far as I can tell, the wild and crazy games do balance out fairly well over the course of a season.  Teams actual runs scored figures don’t deviate from their expected runs by very much usually (a standard error of ~ 25 runs/season). 

So why not just use Runs Created as the cumulative figure?  When you start converting each game as its own unit, you are getting away from what I have always seen as the fundamental purpose of using win estimators to evaluate teams--the assumption that the actual distribution of runs needs a long time to go to ability, and that in small numbers of games the average runs/game (or the average RC/G if we are tossing out the actual runs for the reasons you explained) is a better predictor of future distribution then the current actual distribution is.


#2          (see all posts) 2007/07/23 (Mon) @ 13:27

I disagree.

You can take a look at run distributions and see that by and large, most teams have the same run distribution. 

Run Distributions from 1998-2004: http://beyondtheboxscore.com/story/2006/2/23/164417/484
From 2006:
http://www.hardballtimes.com/main/article/feast-or-famine-first-draft/

While a few 25-run games will skew the distribution somewhat, there is nothing to indicate that teams have the ability to control the shape of their run distribution.  Well, that’s not totally true; teams that hit lots of homers will not be shut out as often as you would expect (http://www.hardballtimes.com/main/article/avoiding-the-famine/).  But beyond that, there’s no reason to think that a few offensive explosions is indicative of anything, other than the overall quality of the team.

Of course, the Pythag has a std error of +/- .04 (~6 games) over a season even if teams distributed runs exactly as according to the Weibull distribution.  I think this goes as 1/(2*sqrt(games)), so even if we played 1000 games, the std error would be .016 (~2.5 equivalent games over a 162 game season).  If we’re interested in using the Pythag as an estimator for true quality, it helps to know how much overlap there is between, say, a “true” 90-win team and a “true” 60-win team, which you can see here:
http://beyondtheboxscore.com/story/2006/4/10/174829/417
http://beyondtheboxscore.com/story/2006/4/20/85517/0034


#3    Tangotiger      (see all posts) 2007/07/23 (Mon) @ 13:31

SABRMatt is tracking Pythag two ways, both using PythagenPat:
1. aggregating RS, RA, and figuring win%
2. figuring win% using the RS, RA of one game, and averaging those win%

The results of both methods are here:
http://detectovision.com/?p=1054

And, the largest gap, as of today, is the Mariners (460 RS, 456 RA), with the first method saying .504 and the second method saying .551.  Their actual record is .568.

I’m not sure that Matt’s way is better than what I’m proposing (sticking to the bases-like, meaning OBP, SLG).  If we look at the aggregate OBP and SLG, the Mariners hitters are doing worse than the Mariners pitchers, and Putz aside, they should have a Pythag record of UNDER .500!

***

The point remains that it takes a certain number of games where the imbalance (illustrated by the two game example) balances out.

There is a certain range where wOBA (or 2OPS) is better than RS,RA which is better than actual winning percentage.  That is, after say 10,000 games, it doesn’t matter what the wOBA, OPS, RS, or RA was, that the actual winning percentage will trump those other metrics because the actual winning percentage will contain additional information and have an uncertainty range that is no greater than the other measures.  After two games, wOBA trumps the other measures.  Everything in-between is for us to figure out at which point one measure trumps the next.


#4    Tangotiger      (see all posts) 2007/07/23 (Mon) @ 13:35

I’m not talking about offensive explosions being indicative of anything. 

I’m taking about taking two exactly equal teams (using BA, OBP, SLG).  But if you clump together their OBP/SLG explosions, they will end up with more runs scored overall than otherwise.  Those extra runs are meaningless, since they are the product of random synergy.

Like I said, my two-game example should give you a total of 9.4 runs.  That two-game example, based on luck of clumping, produced 14 runs.  That means that 4.6 of those runs don’t count in terms of evaluating the talent.


#5    Tangotiger      (see all posts) 2007/07/23 (Mon) @ 13:37

That is, I’m throwing out the 4.6 runs, without throwing out a single hit or out.


#6          (see all posts) 2007/07/23 (Mon) @ 13:50

Okay, I understand now.

I guess the first thing to do would be to test whether converting raw hitting stats to runs would be an advantage over the current Pythag.  I would imagine that it wouldn’t, considering that actual runs tracks pretty well to runs created, but I’m willing to be convinced in this regard.


#7    Guy      (see all posts) 2007/07/23 (Mon) @ 14:12

"And, the largest gap, as of today, is the Mariners (460 RS, 456 RA), with the first method saying .504 and the second method saying .551.  Their actual record is .568.”

I think I can improve on Matt’s model.  For each game, instead of assigning a pythag probability, assign a 1 if the team scores more runs than the opponent, and a zero if fewer runs.  Average the results.  I think you’ll get an extremely close correlation with actual team win%.

Seriously, I don’t get the point of this.  The measure of pythag’s value isn’t how closely it reflects a team’s current record—we know that—but it’s future record (i.e. real talent).  And the whole point of pythag is that a team winning blowout games is actually BETTER than its record indicates, while Matt’s point seems to be to correct for pythag allegedly OVERESTIMATING such teams’ real ability.  I think he’s mistaken about that. 

Tango’s point—that measures like OBP and SLG are more accurate measures of true offensive talent than RS over a small # of games—is of course true.  But didn’t we already know that?


#8    tangotiger      (see all posts) 2007/07/23 (Mon) @ 15:03

But I’m also suggesting that even is runs scored, the 21 that the Yanks scored IS indicative of their true talent, that that talent must be expressed in something additive like wOBA or winning percentage.

The entire point of my example is that you can have two .333 OBP games, or one at .000 (27 PA) and one at .500 (54 PA) with an aggregatve average of .333, that they MUST yield the same expression of talent level.  But, because of the compounding nature of how runs are recorded (runners left on base count as zero runs, runners that score count as one run), that you lose that.

If on the other hand you were to assign say 0.6 runs for players who cross home plate, 0.4 for guys left on 3B, 0.2 for guys left on 2B, 0.1 for guys left on 1B (or something along those lines), that perhaps you wouldn’t have this problem, when you aggregate runs scored.

PA are additive.  Times on Base are additive.  Runs Scored are NOT additive, insofar that you want to relate it directly to the talent level.


#9    Will      (see all posts) 2007/07/23 (Mon) @ 15:06

Is there a table on the Internet somewhere that converts offensive runs scored in a game to an estimated win % as Tango’s post states a game where you score 14 runs would give you a winning record of around .900


#10    tangotiger      (see all posts) 2007/07/23 (Mon) @ 15:07

If it’s not clear, I’m saying to use actual runs scored as the basis, and express it in OBP terms.  A team scored 14 runs?  That’s an OBP of .500 (27 times on base in 54 PA).  A team scored 0 runs?  That’s an OBP of .000 (0 times on base in 27 PA).

Total?  27 for 81, or a .333 OBP.  That .333 OBP translates as 4.7 runs per game.

Remember, I had no idea what their actual OBP or SLG was in each game. I simply created an equivalent OBP, as a proxy for the actual runs scored.  And it’s that equivalent OBP that I am using to add, and not runs scored.


#11    tangotiger      (see all posts) 2007/07/23 (Mon) @ 15:15

Will/9: There is the Tango Distribution (though I didn’t program it for that high an offensive era… it’s set to max 10 runs per inning and 20 runs per game I think).

However, you can do it yourself with a shortcut. If you have 14 runs scored in a league where they allow 4.7, first figure out your exponent as:
(14+4.7)^.28 = 2.27
Then:
(14/4.7)^2.27 = 11.9 = W/L
That makes it a win% of .923

If you want to work that backwards into OBP, it’s a little more complicated, which is why I just the Perfect Run Modeler, and trial and error my way there.


#12          (see all posts) 2007/07/23 (Mon) @ 15:22

IIf we look at the aggregate OBP and SLG, the Mariners hitters are doing worse than the Mariners pitchers, and Putz aside, they should have a Pythag record of UNDER .500!

I agree with Guy on the purposefulness of all this but according my Markov model based on batting events and base running events I get RS = 4.57rpg; RA = 4.75 rpg for a w% of about .48 ... I’d be amazed if they played in October. (And it also shows how awesome the ‘pen is)


#13          (see all posts) 2007/07/23 (Mon) @ 15:23

Guy,

He’s not mistaken. Blowouts lead the Pythag method to overestimate a teams real ability.  For this reason: The likelihood of scoring another run, given that run X has already scored, is greater for a greater X.  That is, if a team has already scored 10 runs, their likelihood of scoring again is greater than if they had scored 5. For the same team, with the same “real” ability.

The Pythagorean method ignores this difference. It treats the Yankees 20th run against Tampa Bay as the same as a run in, say, a run 1 game against King Felix. To be as accurate as possible (i.e., reduce variance between prediction and actual win percentage), the pythagorean method should discount runs after the first.

I can prove this empirically. Using only 2006 data, I did the math and found that using a discount factor, you improve your prediction of win percentage. Each run beyond the first should be discounted at about .925.  That is, The yankees 20th run in their blowout should only be add .925^19, (.227 runs) to their pythogorean run-scored tally. 

Discounting runs as such results in a pretty substantial reduction in the variance between predicted and actual wins (reduced from .021 to .016). Its a better predictor of future performance, and of “real” skill.


#14          (see all posts) 2007/07/23 (Mon) @ 15:25

Will - There is some historical data located in the “run distribution reports” in one of my links above (uh...this one: http://beyondtheboxscore.com/story/2006/2/23/164417/484).

For example, in 1998:
Aggregate win frequency by runs scored
0 0.00
1 0.07
2 0.20
3 0.34
4 0.51
5 0.58
6 0.70
7 0.77
8 0.82
9 0.88
10 0.90
11 0.94
12 0.95
13 0.96
14 0.97
15 1.00

In 2004:
0 0.00
1 0.08
2 0.19
3 0.33
4 0.48
5 0.61
6 0.69
7 0.74
8 0.83
9 0.90
10 0.91
11 0.97
12 1.00
13 0.97
14 0.96
15 0.92

I suppose you could make these charts by assuming league-average pitching and using the Pyth formula at each point.  For example, 2004 had ~4.8 runs/game.  So you could generate the chart for each level of offense as:

0 0.00 = 0^2/(0^1.84+4.8^1.84)
1 0.04 = 1^2/(1^1.84+4.8^1.84)
2 0.15
3 0.28
4 0.41
5 0.52
6 0.61
7 0.68
8 0.74
9 0.78
10 0.81
11 0.84
12 0.86
13 0.88
14 0.89
15 0.91
16 0.92
17 0.93
18 0.93
19 0.94
20 0.95
21 0.95

Doesn’t match up that well with observation, though.  I may have made a mistake.


#15    Patriot      (see all posts) 2007/07/23 (Mon) @ 15:26

I agree completely with Guy’s #7 re: Souders’ method.  Souders has written

We know that the biggest problem with ordinary seasonal Pythag is that is puts too much emphasis on a team’s production in blowout games.  Generally, once the game starts to get out of hand, the losing side starts to send out the reserves, especially the reserve pitchers.  They in essence become a weaker team for the rest of that game.  This often leads to additional run scoring that has essentially no meaning (or very close to none).

I don’t think this conclusion is justified, for a number of reasons:
1. Teams don’t usually bail on the game until at least the 6th or 7th inning, no matter how bad the score is
2. Winning teams often take out their regulars and use their 11th and 12th pitchers too
3. There are essentially no major legaue pitchers that you can put on the field and expect to get taken apart for 20 runs.  If the Yankees scored 10 runs in 6 innings off the DRays real pitchers, and then they send out the bum for the last three, even if he’s truly awful, a 9 RA, you’re only expecting him to give up 3 runs, whereas an average pitcher would give up a somewhere between 1.5 and 2.  The continuing offensive explosion is manifested evidence of the Yankees’ hitting prowess.

Furthermore, his method assumes not only that the distribution of runs scored is more telling then the average runs scored, but that the specific combinations of runs and runs allowed are significant.  I’m not going to argue that R and RA are independent, but I think they are a heckuva lot closer to being independent then the specific combinations are to being meaningful.  And of course I’m not impressed that it tracks actual W% better then the aggreagate methods; I would certainly hope so!


#16    Patriot      (see all posts) 2007/07/23 (Mon) @ 15:39

For this reason: The likelihood of scoring another run, given that run X has already scored, is greater for a greater X.  That is, if a team has already scored 10 runs, their likelihood of scoring again is greater than if they had scored 5. For the same team, with the same “real” ability.

Where is your evidence to support this claim (other then what Tango says about the compounding nature of offense)?

To be as accurate as possible (i.e., reduce variance between prediction and actual win percentage), the pythagorean method should discount runs after the first.

I don’t think anyone is going to disagree with you that you can improve the accuracy of Pyth predictions by incorporating data about the actual scoring distribution of the team.  But that fact is not evidence that the actual distribution tells you more about the teams’ true quality then their average runs/game (or, going with Tango’s whole point here and ignoring the actual runs, their OBA and SLG and RC, etc.).  As Guy pointed out, a Pyth method that says “you win when you score more runs then the other team” defeats the purpose of using Pyth for teams at all. 

Discounting runs as such results in a pretty substantial reduction in the variance between predicted and actual wins (reduced from .021 to .016). Its a better predictor of future performance, and of “real” skill.

Again, I doubt anyone will disagree that it reduces the variance between the expected and the actual.  But the second claim about being a better predictor is not supported by the reduced variance.  So if you have the evidence to back up that claim, I’m all ears.


#17    tangotiger      (see all posts) 2007/07/23 (Mon) @ 15:54

(or, going with Tango’s whole point here and ignoring the actual runs, their OBA and SLG and RC, etc.). 

I just want to reemphasize that this is NOT what I’m saying.  I’m saying to ignore everything EXCEPT actual runs.  And then converting the actual runs into an OBP-like measure (Matt is going the other way and converting actual runs into a wins-like measure).  14 runs scored in 9 innings is a .500 OBP (27 for 54).  0 runs scored in 9 innings is a .000 OBP (0 for 27).  The total, 27 for 81, is a .333 OBP which is 9.4 runs scored over 18 innings.

That’s the point.  That the extra 4.6 runs that were scored were a figment.  Why?  Because they were scored when there were so many runners on base, that every time you made a hit or walk, you got a huge run value for that event, far more that it’s worth than if you did it in two .333 OBP games.  A hit is a hit is a hit.  But, a hit in a .500 OBP game is worth far more than in a .000 OBP game, and (weighted) averaging the two will make it worth far more than in a .333 OBP game.

To recap:
- only consider runs scored
- convert them in such a way as to make them additive


#18    studes      (see all posts) 2007/07/23 (Mon) @ 15:55

Personally, I find this intriguing, but (as others have said) would like to see the proof that this predicts future performance better than Pythag.  Has Matt posted that anywhere?


#19    Guy      (see all posts) 2007/07/23 (Mon) @ 15:57

Tango/10:  What’s the basis for converting RS to an imputed OBP?  The typical OBP in a shutout will of course be something like .150, not .000. 

I think you’re assuming that a high-scoring game always involves “bunching” of offense, and vice-versa.  But a team might score 10 runs on 25 hits and be inefficient, or score 2 runs on 2 hits and be very efficient.  In any case, wouldn’t it be simpler to just regress the teams back to league average (base on 2-game sample) and be done with it? 

CDM/13:  Basically, what Patriot said. “Predicting” current W-L will of course be aided by knowing actual run distribution; if it aids with future prediction, that would be interesting. 

Also, the discount method you came up with could be improved.  The first run is not actually the most valuable—runs 2, 3, and 4 are.  If you weighted runs by the marginal value of each run in Sal/14’s win% chart—run#1 * .04, #2 * .11, #3 * .13, etc.—you would get an improved estimate.


#20    tangotiger      (see all posts) 2007/07/23 (Mon) @ 16:06

Guy/19: you can definitely regress the actual runs scored to the league average OBP for that runs scored level.  So, if a team scores 0 runs, their wOBA may actually be .100.  And if they score 20 runs, their wOBA may actually be .500.  I’m fine if you want to do it that way.

But, that’s the main point, that percentages are additive (OBP, or win%, etc), but runs scored are NOT additive.


#21    David Gassko      (see all posts) 2007/07/23 (Mon) @ 16:16

Tom,

I don’t really see where you’re going with this. It seems to me that you’re simply suggesting that Pythagorean record be calculated using expected runs scored and allowed, which is already something Clay Davenport does over at BP using EqR and we do over at THT with the Dartboard, using BaseRuns.


#22    tangotiger      (see all posts) 2007/07/23 (Mon) @ 16:21

From 2000-2006, here is the team’s wOBA for each runs scored level (if the team had 27 batting outs):
R wOBA PA OB n
0 0.181 32.9 5.9 1685
1 0.223 34.8 7.8 2851
2 0.257 36.4 9.4 3338
3 0.284 37.7 10.7 3350
4 0.311 39.2 12.2 2870
5 0.335 40.6 13.6 2341
6 0.353 41.7 14.7 1883
7 0.375 43.2 16.2 1356
8 0.393 44.5 17.5 970
9 0.412 45.9 18.9 681
10 0.421 46.7 19.7 525
11 0.441 48.3 21.3 346
12 0.453 49.3 22.3 226
13 0.470 50.9 23.9 143
14 0.479 51.8 24.8 103
15 0.494 53.4 26.4 60
16 0.502 54.2 27.2 41
17 0.513 55.5 28.5 18
18 0.548 59.7 32.7 12
19 0.543 59.0 32.0 9
20 0.522 56.5 29.5 3
21 0.554 60.5 33.5 3
22 0.543 59.1 32.1 3
26 0.523 56.6 29.6 1

Obviously, you’d want to smooth that out at the end.

So, if you scored 0 runs and 14 runs, you’d get this:
0 R: 32.9 PA, 5.9 times on base
14 R: 51.8, 24.8

The total is 84.7 PA, 30.7 times on base, for a wOBA (or OBP) of .362.

If you had a 6R game and a 7R game, you’d get a .364 game.

So, this is what I’m talking about (with Guy’s method).  The 0+14 runs is equivalent to 6+7 runs.


#23    tangotiger      (see all posts) 2007/07/23 (Mon) @ 16:25

David, I keep saying to only use actual runs scored. 

1. Take actual runs scored
2. Convert to equivalent PA, H, BB, TB
3. Add up the equivalent values
4. Convert the values in step 3 into equivalent runs scored

In short, convert actual runs into equivalent runs using only actual runs!  (See later post 22 for one implementation method)


#24    tangotiger      (see all posts) 2007/07/23 (Mon) @ 16:32

I should have provided this as well:
R win%
0 0.000
1 0.036
2 0.103
3 0.173
4 0.278
5 0.407
6 0.513
7 0.651
8 0.729
9 0.803
10 0.867
11 0.905
12 0.947
13 0.972
14 0.981
15 0.983
16 0.976
17 1.000
18 1.000
19 1.000
20 1.000
21 1.000
22 1.000
26 1.000

In post 23, step 2, I said to convert runs DOWN to bases/outs.  You can instead convert UP to wins.

In that case, scoring 0 runs gives you a win% of .000 and scoring 14 gives you .981.  It should be noted however that the .000 wins was based on 32.9 PA and the .981 wins was on 51.8 PA.  The weighted average for wins is .600.

Scoring 6 runs (.513 wins on 41.7 PA) and 7 runs (.651 wins on 43.2 PA) gives us an average of .583.

So, a team that scores 0 runs + 14 runs is roughly equal to a team that scores 6 runs + 7 runs.


#25    Guy      (see all posts) 2007/07/23 (Mon) @ 16:34

Tango:
I’m not sure this will change your estimate at all. 
11RS and 5RS:  OBP .393
OBP for 8RS:  .393

10RS and 2 RS:  OBP .350
6 RS:  .353

You’re assuming the RS reflects true talent, and in that case you’ll get the same results as just averaging RS.


#26    studes      (see all posts) 2007/07/23 (Mon) @ 16:51

So, in the converting up example, you and Matt are suggesting that the pythagorean formula could be used to turn each individual game RS and RA into a winning percentage (based on league-average RS), and then averaged out to come up with a new type of pythagorean output?


#27    tangotiger      (see all posts) 2007/07/23 (Mon) @ 16:52

It would seem that this disaffect is at the very extreme cases.  It may very well turnout, as Guy is showing, that even in extreme teams, that the effect is negligible.

***

It should also be worth pointing out that the win% numbers should probably be weighted on PA, and not equally weighted.


#28    studes      (see all posts) 2007/07/23 (Mon) @ 17:34

By the way, I should point out that when Matt has posted about this issue before, he hasn’t been concerned with the forecasting aspect of pythagoras, but with allocating credit for wins between offense and defense (i.e. Win Shares).


#29    Guy      (see all posts) 2007/07/23 (Mon) @ 18:23

"It should also be worth pointing out that the win% numbers should probably be weighted on PA, and not equally weighted.”

If you use the win% approach, I would think 27 outs (game) is the right weighting.  Given the correlation between RS and PA, weighting by PA will tend to overstate the team’s ability to score per 27 outs, which is what matters.  Also, it seems odd to weight offense and defense in a single game differently to assess a team’s talent.  If a team wins 14-2, are you going to give 60% more weight to the hitters’ performance than the pitchers/fielders?

* *

And why only .407 for 5 RS?  Are you perhaps counting tie/xtra innings as a loss?


#30    tangotiger      (see all posts) 2007/07/23 (Mon) @ 19:25

I probbably have a bug.  I did that real quick.

***

As for the win%, we are looking at talent levels of players, and noting them in “team” terms.  The weighting has to be in PA, I’d think.

In any case, it’s clear that OBP has to be in PA terms.

And it looks like the win% to OBP relationship stick on a PA level not outs level.


#31    bedir than average      (see all posts) 2007/07/23 (Mon) @ 19:42

Studes, I do know that Matt this year is taking snapshots at various points during the season to compare the predicted win total for both versions of Pythag he’s running.  After the season he’ll only have one season to look at, but it is better than zero.


#32          (see all posts) 2007/07/23 (Mon) @ 20:01

I know in the beginning of the season the pythag record better predicts future record and towards the end actual record predicts final record more then the pythag....does anyone know the point in the season in which they cross?


#33    SABRMatt      (see all posts) 2007/07/23 (Mon) @ 20:52

The reason I prefer to go up to W% like figures rather than down is that going up to W% does a better job of showing a team’s SITUATIONAL strengths...a team with a great bullpen will look better by PythagenMatt than by any attempt to weigh each game on OBP/SLG


#34    tangotiger      (see all posts) 2007/07/23 (Mon) @ 22:12

Matt/33: it seems that you are also missing the point I was trying to make.  Must be me, since you’re the 4th or 5th so far.

I am not, at all, looking at OBP or SLG.  The absolute only single parameter I am looking at is runs scored.  That’s it.  I am simply expressing runs scored on an OBP scale.  That is, given that exactly 14 runs were scored, what combination of BA/OBP/SLG will give you that?  The answer: 0.417 / 0.500 / 0.625 .  It doesn’t matter if the team actually was on base .500 or not.  It doesn’t matter if the SLG was .625 or not.  But, the impact of 14 actual runs is equivalent to a random team hitting .417/.500/.625 randomly.

That’s all that I’m doing.  Since we’ve ascertained that we cannot add 14 runs from one game and 0 runs from another game to presume that such a team will score 7 runs per game, we need to construct a mechanism that will allow us to do that.

That scale is OBP.  The numerator of OBP and denominator of OBP are each additive.  And the aggregate of the numerators and denominators are divisible.

***

Guy suggested a different method, and that is to look at actual teams that actually scored 14 runs, and infer the OBP/SLG that would be its equivalent.  I’m ok with that process too (probably prefer it).

But again, I am looking at one and only one parameter: runs scored.


#35    Pizza Cutter      (see all posts) 2007/07/24 (Tue) @ 00:45

A way out: Take Matt’s game-by-game methodology and add a few statistical tricks.

1) Create a database of however many years of Retrosheet logs you want.  I did 1980-2006.

2) Isolate how many runs each team scored and gave up in the game and whether or not they won it.

3a) Either empirically tabulate the probability of the team winning the game having given up X number of runs (i.e., a team that scores 5 runs in a game wins XX% of the time) and the probability of the other team winning if the team we’re interested in gives up Y number of runs… or

3b) Run a couple of binary logit regressions and save the predicted probailities.  This is what I did.  It solves for the fact that the 21st run doesn’t add much of anything over what the 20th run did.

4) Insert these two percentages into the log5 method equation.  This is the expected winning percentage of that game for your team, so if it comes out at .600, credit your team with .6 expected wins.  Add up and divide by the number of games for an expected win percentage.

5) Compare to the actual win percentage.  Over the 27 years in my database, the correlation is .973.  Compare that to the Pythag1.82, Davenport, and Patriot methods which check in around .935 each over that time span.

The bad news is that the residuals in my model also track actual winning percentage exceptionally well (r = .891), so it has that rather big bias.  Good teams always look like they are outperforming their predictions in here and bad teams always look worse than their prediction.  It’s not going to be a good measure of luck/managerial skill, if such residuals ever were.

An idea that came from reading the above.


#36    Guy      (see all posts) 2007/07/24 (Tue) @ 09:27

Matt/33, Pizza/35:
There are two factors that separate win% from pythag:  A) distributional efficiency – how helpful is the distribution of RS (consistency=good) and RA (consistency=bad), given certain overall averages, and B) matching efficiency – are RS and RA matched in a helpful way at the game level (4RS - 3RA, 7RS – 6RA).  Most analysts believe these are both matters of luck.  Matt clearly believes one or both are skills; I can’t tell where Pizza comes down.

To show either is a skill, you have to show your metric predicts future win% better than pythag.  Using game-level data to better estimate past performance is no great trick (as I mentioned, a formula of RS>RA=1, RS

Even if these aren’t skills, it might be interesting to separate the two factors, to see if over/underperformance vs. pythag is mainly a consequence of one factor or the other.  Pizza almost does this, but using individual game data muddies the waters.  First, you could calculate “distributional pythag”:  how many games a team should win with RS of a*0, b*1, c*2….and RA of x*0, y*1, z*2…, assuming total independence of RS and RA (might need a park adjustment on that).  Then you have:

Distrib. Pythag minus Pythag = distributional efficiency
Distrib pythag minus actual Win% = matching efficiency.

But these are purely of academic interest, unless/until someone shows that one or both of these are real skills.


#37    Guy      (see all posts) 2007/07/24 (Tue) @ 09:32

That’s weird—half of my 2nd graf disappeared.  Should read:

To show either is a skill, you have to show your metric predicts future win% better than pythag.  Using game-level data to better estimate past performance is no great trick (as I mentioned, a formula of RS > RA = 1, RS < RA = 0 will beat Pizza’s much more complex model).  So, for example, Pizza might see if 2nd-half win% correlates better with 1st-half performance on his metric than it does with 1st-half traditional pythag.  My guess is no, but it’s an interesting question.


#38    tangotiger      (see all posts) 2007/07/24 (Tue) @ 09:52

I want to continue to reiterate that this is not about efficiency or even distribution of runs.

This is solely about how the compounding/synergy nature of the runs doesn’t allow for the sum of the parts to equal the whole at the extreme points.  That the distribution of performance (not runs) leads to more runs than would otherwise be expected in the aggregate.

Again, go back to my original example, of two games or the example in post 22.

The point is that the game-by-game runs are not summable as a proxy for the true talent.  It must be converted into a scale (OBP or some other rate state) that is summable.


#39    tangotiger      (see all posts) 2007/07/24 (Tue) @ 10:20

Guy, my problem with the chart is that I only looked at teams (home or away) that recorded exactly 27 outs.  You can imagine the selective sampling issues there.  (I was trying to focus on runs per 27 outs, and I didn’t want to have fractions, etc.)

Anyway, here is the full chart, without any filtering:

R win% OBP Outs n
0 0.000 0.180 27.1 1737
1 0.080 0.223 27.2 3157
2 0.215 0.258 27.1 4105
3 0.335 0.284 27.0 4541
4 0.473 0.311 26.9 4386
5 0.594 0.336 26.8 3910
6 0.688 0.356 26.6 3333
7 0.783 0.377 26.4 2578
8 0.836 0.393 26.3 1882
9 0.880 0.413 26.2 1370
10 0.922 0.429 26.0 1046
11 0.944 0.444 25.8 695
12 0.961 0.460 25.8 460
13 0.984 0.473 25.6 307
14 0.985 0.484 25.8 194
15 0.982 0.495 26.3 112
16 0.988 0.508 25.7 82
17 1.000 0.519 25.3 41
18 1.000 0.534 25.3 27
19 1.000 0.556 25.6 17
20 1.000 0.542 24.8 11
21 1.000 0.554 26.3 4
22 1.000 0.560 26.3 4
23 1.000 0.563 24.0 3
25 1.000 0.655 24.0 1
26 1.000 0.523 27.0 1


#40    Guy      (see all posts) 2007/07/24 (Tue) @ 10:26

"I want to continue to reiterate that this is not about efficiency or even distribution of runs.”

I think we have two discussions going:  Tango’s issue of runs vs. OBP, and a discussion of Matt’s (and now Pizza’s) use of game-level data to modify pythag.  Nothing wrong with two threads (is there?), as long as everyone is clear which issue they’re addressing.

* *

On Tango’s point:  I don’t think extreme scores are so divergent from OBP as to make this an important distinction.  Let’s take 14 run games, the highest score for which you have n>100.  The actual OBP is .479, which would usually produce about 12.3 R/G.  So even in these very extreme games, about 80% of the runs above average (9 in this example) are created by actual improved production by the hitters, and only 20% comes from lucky “bunching” of hits.  (And that assumes the HR/OBP ratio is normal in these games, which seems unlikely.  So some of the 20% probably stems from a high SLG% in these games.) At the other end, when a team gets shut out the hitters’ have actually produced at about a 1.5 R/G level, so 70% of their shortfall is earned. 

Unless one were trying to divine the Yankees’ true talent from a 2-game sample (and why would you ever do that), it doesn’t seem that this distinction is very important.


#41    Tangotiger      (see all posts) 2007/07/24 (Tue) @ 10:28

Looking at the above, I think that stating runs scored as an equivalent OBP (which from now on I’ll designate as rOBP, for run-equivalent OBP) is the best way to do the summing. 

I don’t think turning it into an equivalent win% is the correct thing to do.  A team can score 17 runs or 170 runs, and it gets the same value.  And that’s clearly not right, in terms of establishing the true talent value.


#42    Tangotiger      (see all posts) 2007/07/24 (Tue) @ 10:31

Guy, I’m happy with having two on-going discussions, as long as we’re clear on it. 

***

You may be right that the extreme cases that would cause us problems simply don’t occur enough to give us any trouble.  Only one way to find out.  I’ll be back soon…


#43    Anthony      (see all posts) 2007/07/24 (Tue) @ 13:01

Based on the chart in #39, a team with 16 runs in two games would have these equivalent OBPs:

0+16: .381
1+15: .386
2+14: .389
3+13: .390
4+12: .393
5+11: .394
6+10: .394
7+9: .395
8+8: .393

So this means the team with a 7-run game and a 9-run game has the highest true talent level...and we’d expect them to win more games in the future than the other combinations?

Essentially this is saying normal Pythagorean methods underrate consistency, no?


#44    Pizza Cutter      (see all posts) 2007/07/24 (Tue) @ 13:08

Anthony, a way to test that postulate is to check to see whether a team’s SD in runs scored or runs allowed per game over the season correlates with its Pythagorean residuals.


#45    Guy      (see all posts) 2007/07/24 (Tue) @ 13:33

I don’t think so, Pizza.  You need to make a distinction between past and future performance.  You should indeed find that positive pythag residuals correlate with large COV for RA, and with low COV for RS (given sufficient sample size), given the results in Tango/39 which show that scoring 4 or 5 runs yields the largest bang for the buck in wins/run.  But that doesn’t mean that “consistent” teams have more true talent.


#46    Pizza Cutter      (see all posts) 2007/07/24 (Tue) @ 13:54

Guy, you’re correct that the analysis I propose wouldn’t tell us whether consistency necessarily means greater true talent (measurement validity).  Anthony’s question was on whether Pythagorean methods are biased to over- or under-estimate consistent vs. non-consistent teams (mesurement bias).


#47    studes      (see all posts) 2007/07/24 (Tue) @ 14:51

I’m in another place than you folks, but I did look at 2006 team records for the first half of the year and compared them to how they performed in the second half.  I found that Matt’s approach predicted the second half records slightly better than the PythagenPat formula did.

Just FYI.


#48          (see all posts) 2007/07/25 (Wed) @ 15:02

Patriot/16 & Guy/19:

OK. So you think its boring that discounting runs reduces the prediction error within a sample of games (e.g., reduces training error).  You want to see that it improves prediction for an entirely separate sample of games (e.g., improves generalization).  Essentially, your concerned that I’m over-fitting the data.  Does this address your concern:

I ran a cross-validation analysis (leave one out analysis, N-1 analysis, etc.).  For each team, for each game, I created a prediction based on the other 161 games. I either summed the runs scored and runs allowed (standard pythagorean method), or summed the discounted runs scored and allowed (discounted pythagorean method).

That gives me 162 predictions for each team, but critically, I never considered the “testing” game when I created my prediction.  Thus, this is, in the truest sense, a predictive measure: I’m not finding the best fit of one variable to another; I’m finding the best fit, and then applying that fit to a novel set of data and testing whether it generalizes. 

So, I compared each prediction to the actual outcome of each game.  Then I summed the variance between actual and prediction for both methods, and determined which made a better prediction.  I found that the discounted method provided a significantly better prediction than the standard pythagorean method.

The difference isn’t huge, but its highly statistically significant (due to the huge N).


#49    Sky      (see all posts) 2007/07/25 (Wed) @ 15:51

John/32—I don’t have an answer to your question, but I don’t think it’s as interesting as you think.  The only reason actual record is worth anything is because it tells you what actually happened.  And using A to prove A+x will eventually be perfectly accurate as x approaches zero—but it also becomes perfectly pointless.

A better question is “at what point in the season is actual record a better prediction of future record than Pythag record is of future record.” And my guess is that point doesn’t exist.


#50    tangotiger      (see all posts) 2007/07/25 (Wed) @ 16:36

Sky/49: after around 140 games
http://baseballprospectus.com/article.php?articleid=3490

Discussion of the above article is here:
http://www.battersbox.ca/article.php?story=20040923122101999


#51    john      (see all posts) 2007/07/25 (Wed) @ 16:38

yeah thats exactly what i was wondering.  When it becomes a better prediction of future record then the pythag record is of future record.


#52    john      (see all posts) 2007/07/25 (Wed) @ 16:39

thanks tango

thats exactly what i was looking for


#53    Guy      (see all posts) 2007/07/25 (Wed) @ 17:01

CDM:  I’m certainly prepared to believe that you’ll get a small improvement, if only because discounting extreme performances will essentially improve your assessment of the team’s true offensive ability.  Throwing out the 3 most lopsided games (or whatever) might get you the same thing.  But isn’t this an awful lot of work for a minor improvement? 

I’d also suggest you try modifying your discounting system.  It doesn’t seem that discounting every run after the first would be the best approach.


#54    tangotiger      (see all posts) 2007/07/26 (Thu) @ 06:59

http://www.hardballtimes.com/main/article/ten-things-i-didnt-know-last-week41/

I took the first half of 2006 and calculated each team’s winning percentage, Pythagenpat percentage and Matt’s approach, then compared how well each metric predicted that team’s second-half record. Here’s what I found:
...
The Pythagenpat formula was a definite improvement over the first-half winning percentage, with an R-squared of .11.

Matt’s method was a slight improvement over Pythagenpat; R-squared of .12.

r-squared of .11 means r of .33
r-squared of .12 means r of .35

That is, you regress using Matt’s method 65% (1-.35) toward the mean, after the first half of the season.


#55    Guy      (see all posts) 2007/07/26 (Thu) @ 10:49

The question I have is whether Matt’s method is any more accurate than just regressing traditional pythag?  That is, is pythagMatt just applying a little regression to pythag—which we don’t need game-specific scores to do—or is it providing additional information?


#56          (see all posts) 2007/07/26 (Thu) @ 14:48

I think Matt’s method does have a built-in advantage because it requires additional data (matched pairs of RS and TA).  It removes, to a certain extent, the “problem” of blowouts.

Tango’s method seems just as functional, if not more so, in addition to not requiring that you know how the individual games’ RS match up with RA.  Sure, there’s an extra computational step—converting Runs to the OBA-scale and then back—but the input data is no different from traditional Pythag or its variants.

--------

Does anybody know a website with game-by-game scores in an easily importable (all games at once) format?  Shouldn’t be too hard to implement Tango’s approach with that data and his table above…


#57    Tangotiger      (see all posts) 2007/07/26 (Thu) @ 15:02

Guy/55: that is what is being done.  If the r=.35, then the equation is:
win% = .65 * (pyth - .500) + .500


#58    Tangotiger      (see all posts) 2007/07/26 (Thu) @ 15:07

Sky/56:
http://www.baseball-reference.com/pi/tpgl_finder.cgi

Set the two year parameters to the same thing.  Change “sorted by” to “date”, and check “Ascending”.

You get 200 games at a time.  So, for a full season, you have to click “get next 200” 11 or 12 more times.

Or download the Gamelogs from Retrosheet, which is what I do.


#59          (see all posts) 2007/07/26 (Thu) @ 17:51

Following the logic of this thread, pitcher ERAs are subject to the same “OBP chaining” bias—bad outings result in more runs than they “should” when viewed from a OBPruns perspective.  Thus it makes sense that inconsistent pitchers end up having a higher ERA than their composite ERA would show—another reason composite ERA is a better indicator of future ERA.


#60          (see all posts) 2007/07/27 (Fri) @ 09:35

My apologies for the misunderstanding...what you said sounded like you were taking the OBP of the Yankees to be the thing to add rather than a converted OBP.  I’m still not sure it adds anything to go down to OBP and then convert to W%...your correlations are almost exactly the numbers I got (though PythagenMatt got a slightly higher R^2 with all of the games from 1876 to 2005...that could just be the difference between my sample size and yours.

Although it’s true that most of the time, good teams (W% wise) look better in PythagenMatt than PythagenPat, that’s not universally the case.  The 2005 Cleveland Indians for example rated significantly worse by PythagenMatt than by PythagenPat despite a high W%...and in 2006, the same is true but the actual W% dropped to match the game by game prediction.

I have not tested in-season predictive power of PythagenMatt as that is not an area I’ve done much of (in terms of experimental design).  Can you (or anyone reading this thread) suggest a good method for checking the predictive power of this tool compared to PythagenPat?


#61    Guy      (see all posts) 2007/07/30 (Mon) @ 10:07

Tango/57:
Maybe I’m missing your point.  But the question I’m raising is whether pythagMatt gives us any info beyond what we’d get simply by regressing pythag?  Let’s say after 81 games we have a team that is 5.1 RS and 4.1 RA in a 4.5 league.  With no other info, your best 2nd-half prediction would be to regress each #, then calculate pythag.  If regressed pythagMatt can’t beat regressed pythag, it hasn’t really given us new information. 

In other words, does the game-by-game data really tell us something about a team’s ability to score/prevent runs efficiently, or is it just a back-door (and unnecessarily complex) way to regress a half-season’s worth of RS and RA data?  My guess is #2.


#62          (see all posts) 2007/07/30 (Mon) @ 10:56

I don’t think it’s a back-door way to regress RS and RA data...I think there is information within that gives you some indication of the CONSISTENCY of a team’s offense and defense, and I think a team with a great bullpen is likely to do better by PythagenMatt than by any other method.  I haven’t proved it obviously, but that is the reason I created PythagenMatt in the first place...I think the game by game record of a team contains useful information about whether a team is more capable of winning than its’ primary statistics would lead you to believe even when regressed.


#63    Guy      (see all posts) 2007/07/30 (Mon) @ 13:42

Matt:
That could be, and if so it would be really interesting.

However, keep in mind that “consistency” is only a good thing on offense.  On defense, it’s INconsistency that you want.  In fact, if your method is picking something up, that’s probably what it is.  A 4.5 RA team really has 5 different RA distributions, depending on who’s starting.  To take a simple example, pythag expects a 4.5/4.0 RS/RA team to be .553.  However, if half the starts come from 5.50 RA pitchers, and the other half are 2.50, then expected win% is .570.
However, I’m skeptical that team offenses have any inherent consistency.

It does seem like a good bullpen should help teams beat their pythag.  But I don’t think anyone’s ever come up with real evidence for that (yet).  We’ll see....


#64          (see all posts) 2007/07/30 (Mon) @ 23:57

Right, but that’s exactly what I’m talking about.  I’ve watched the mariners all year this year and I’ve observed three things about them.

1) Their offense is very consistent.  They tend to bang out 8-10 hits every night.

2) Their bullpen is DOMINANT.  They win just about every time they have a lead after 6.

3) Their rotation is spectacularly inconsistent.  They are all capable of good performances and catastrophic blowouts.

All three combine to make them a much better team than their regressed pythag would look.  PythagenMatt lines up with this observation.


#65    Guy      (see all posts) 2007/07/31 (Tue) @ 05:44

"All three combine to make them a much better team than their regressed pythag would look.  PythagenMatt lines up with this observation”

Which tells you nothing.  PythagenMatt “lines up” because Seattle has outperformed its pythag and pythagenMatt uses game-level results.  The question is whether teams like Seattle have an actual ability to beat their pythag.  That remains to be seen.  And even if they do, the accuracy gain over regressed pythag is certain to be very small.


#66          (see all posts) 2007/07/31 (Tue) @ 11:55

You say that with absolute certainty that is not warranted by research.

Seattle is 104 games (almost 2/3) of the way through the season and has spent zero (0) days underperforming pythag.  Their Matt-Pat differential has consistently exceeded 4.5 games on a 162 game scale (the difference between their projected P-Matt projection and P-Pat projection has been larger than 4.5 games) the whole season.

Obviously, I haven’t proved anything, but I am of the opinion right now that if you have all three of the things I mentioned above (consistent offense, inconsistent defense and a great bullpen), you will be able to outperform pythag by as much as 10% even over a long time scale.


#67    Steve      (see all posts) 2007/08/11 (Sat) @ 22:13

Look at the total WPA of relief pitchers on a team and see if there’s a positive correlation between high bullpen WPA and outperforming pythag (there is in 2007, every team outperforming pythag bar one - the White Sox - has a high or very high bullpen WPA). That quantifies the relationship between a good bullpen and pythag.


#68    Pizza Cutter      (see all posts) 2007/08/11 (Sat) @ 23:11

I was running much the same thing for another piece that I’m writing.  From 2000-2006, the correlation between bullpen WPA added per one-run game and outperforming of Pythagorean projection (I used the Patriot formula) is .374.  For starters, it’s .174 and for the offense, it’s .284.  The correlation between winning percentage in one run games and pythagorean residuals is .652 from 1980-2006.  A good bullpen sure does help win those one run games.


#69    Steve      (see all posts) 2007/08/12 (Sun) @ 10:17

Very interesting, I’ve been thinking it would be valuable to quantify how a team’s construction affects its ability to outperform pythagorean projection for some time and that sheds some light on the issue. It’s intuitive that this would be the case of course, without looking at the numbers we would expect that a superior bullpen can elevate a team above another with the same run differential. As mentioned above, the distribution of runs created by an offence may even itself out, but the distribution of runs given up by relief pitching can be controlled to a certain extent by bullpen management.

I’m glad that someone else has been thinking the same thing with respects to WPA and has ran the numbers, hopefully this sort of analysis can help put pythag predictions into context.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main