THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

Mail:You ask:We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, October 03, 2006

Baseball Prospectus’ WARP1 is wrong

By Tangotiger, 12:50 PM

Let’s start off with the defintion of WARP-1:

WARP-1
Wins Above Replacement Player, level 1. The number of wins this player contributed, above what a replacement level hitter, fielder, and pitcher would have done, with adjustments only for within the season.

Then, let’s look at a .500 team.  When I need a .500 team, pretty much without fail, I look for the Houston Astros, and they satisfy my needs.  The scored about as much as they allowed, and won about as much as they allowed.  Let’s take a look at their team:


BP does a great job in presenting their stats, making my job very easy:
http://www.baseballprospectus.com/dt//2006HOU-N.shtml

If you go down to the “Advanced Batting Statistics” section, which is a misnomer, since the data there is batting, fielding, and pitching, the team WARP-1 totals is 58.5 wins.  The Astros won 82 games, which is pretty much what their RS/RA numbers would have expected.  82 minus 58.5 is 23.5 wins.  23.5 / 162 = .145.  Another perennial .500 team I like is the Seattle Mariners.  Their team WARP-1 is 52.1, and they won 78.  Their RS/RA would have expected around that as well.  78 minus 52.1 = 25.9 wins.  25.9/162=.160.  The Yanks won 97, which is also around their pythag record.  Team WARP-1 is 71.9.  97 minus 71.9 = 25.1 wins, or a .155 record.

What do we learn here?  That BP’s WARP-1 treats the replacement level as a team with a .150 record. 

I’ve shown elsewhere on this site (click on the “Talent Distribution” category at the bottom of this entry) how the most likely team replacement level is a .300 record.  This can be shown in many many ways.  And, that pretty much is the number most analysts would use.  To use anything else is, frankly, just plain wrong.  Or needs a ton of explaining.

The replacement level that I use are: for a position player and a starting pitcher is .380.  For a reliever, it’s .470.  A team of such players will win .300 games. 

So, why does BP calculate WARP-1 the way they do?  The likelihood is that it treats a “replacement-level’ position player as a replacement-level fielder and replacement-level batter.  But, such a player is not the 420th best position-player in the world.  He’s probably not even in the top 1000 players in the world.  Why is this the benchmark?  What does it tell us?

I know all about the 1899 Spiders, and the recent Tigers.  It doesn’t matter.  Even if an MLB team posts a .140 or .250 record, our best estimate of the true talent level of these teams is nowhere close to those records.  They probably need to be regressed 25-50% towards the mean. 

#1    David Cameron, 2006/10/03 @ 02:58 PM

Clay Davenport, the creator of WARP, is aware of this.  It’s been brought to his attention many times.  I’ve talked to him about it personally.  He’s never bothered to fix it, for reasons no one can really understand. 

It’s also why VORP doesn’t appear on the BP player cards, because Keith Woolner actually calculates replacement level correctly, so VORP and WARP using different replacement level values, and they don’t line up with one another.  Even though both are “BP stats”. 

Combinining the incorrect replacement level and the major flaws of FRAA and FRAR as fielding metrics, WARP is totally useless.

#2    studes, 2006/10/03 @ 03:07 PM

Thanks for this very clear explanation, Tango.  I’ve tried to explain this to others but failed.  Your post will serve as a good reference.

#3    tangotiger, 2006/10/03 @ 07:50 PM

My issue is not really the way Clay does it.  It’s the way it’s presented.  People like me are nuts, and will think about this alot.  Most people are not nuts like that, and will take it on faith.  I’d rather see BP put a disclaimer, showing what people think is wrong with it, so the reader is made informed.

#4    David Gassko, 2006/10/03 @ 08:24 PM

Well, my main problem with WARP is that it invalidates any study done with it...i.e. Nate Silver’s otherwise spectacular chapter in Baseball Between the Numbers. The values he comes up with are simply too low, because the replacement threshold is too low. Now that’s a serious issue.

#5    Patriot, 2006/10/03 @ 08:30 PM

For many years the only widely available published comprehensive sabermetric system was the Pete Palmer methods, which set the baseline at average and therefore probably too high for most questions.

Now there are two more: Win Shares and WARP, both of them using baselines that are too low to even be called “replacment” (in the general sabermetric understanding of that term).

Nobody has published a system on the middle ground, despite the fact that the .350-.400 level is favored by a plurality of sabermetricians.  Weird.

#6    David Gassko, 2006/10/03 @ 09:01 PM

Brandon,

Just wait for The Hardball Times Annual—you’ll see something.

#7    tangotiger, 2006/10/03 @ 09:06 PM

And it doesn’t look like there’s any relief pitcher penalty, choosing instead the 6.11 RA, in a 4.50 RPG environment, for all pitchers.  I think. 

I choose the level as 5.80 RPG for a starter, and 4.80 RPG for a reliever.  (That’s these days.  My guess is that the relief penalty isn’t as harsh in the older days.)

#8    dq, 2006/10/03 @ 09:26 PM

I dont understand why people use .400 or even .350; there are lots of teams that play .400 ball that have players clearly above the replacement level.

There have been 4 teams in the last 10 years to play .333 ball or less. Two of them had a player clearly above replacement level -The Big Unit was on one of them, and the 98 Tigers had Higginson with an OPS+ of 147. The other Tiger team was well below .333. So if you take one player off of each of those teams and add the other Tiger team, you have 3 teams at around .300 in the last 10 years.

If you set a AAA/MLE at .80, and have a 5 run environment, your average AAA team gets 4 runs scored & 6.25 runs against; about a .307 team.

.300 makes a lot of sense.

#9    Patriot, 2006/10/03 @ 09:35 PM

Ah, the old curse of the OW% terminology coming back.  A .350 team doesn’t have a .350 offense and a .350 defense.

When people talk about a .350 offense, this is a level of offense that would produce a .350 W% coupled with an average (.500) defense.  With Pyth exponent 2, this corresponds to about 73% of the league average runs scored.  For a defense, it’s about 137%. 

A team that scores runs at 73% of the average and allows runs at 137% of the average will have a .220 W%.  So a whole team of “.350” players would be a .220 team.

#10    David Smyth, 2006/10/04 @ 06:40 AM

At post #6, Davenport tries to explain his choice.

#11    , 2006/10/04 @ 07:05 AM

I’m particularly annoyed by the low level of fielding replacement. It skews how a reader sees the process of accumulating value.

For instance, my favorite example. Barry Bonds has 1720 BRAR and 355 FRAR for a total of 2075 BRAR. Which means that 17% of his total value above replacement is fielding. Now I think Barry’s been a very good, maybe great fielder for a long time during his career, but how likely is it that a corner outfielder who hits like Ruth and Williams is creating 17% of his value from fielding?

Or Gil Hodges: 512/253 BRAR/FRAR. 765 total RAR, 33% of them from defense. Hodges’ offense is generally overstated by the old-guard reliance on RBI and his defense is often talked about in glowing terms, but I just can’t imagine any All-Star-level 1B’s value being one third defensive.

Computes, but does not compute.

#12    Tangotiger, 2006/10/04 @ 07:10 AM

I don’t really have a problem with that.  It may very well be that, in isolation, Clay’s numbers in terms of the breakdown of Bonds and Hodges off/def value is correct.

The problem is that you can’t then just add the two numbers.  We seem to have this fascination to make sure things add up, and if they make sense in isolation, then, adding up should also make sense.

It doesn’t.

#13    David Smyth, 2006/10/04 @ 07:21 AM

My problem is that there is no such thing as *the* replacement level. There is only some variable baseline which is most appropriate for some given study or evaluation.

It certainly may be true that the .400 type baseline is more often appropriate than any other, but that doesn’t mean that it is *the* replacement level.

#14    , 2006/10/04 @ 07:26 AM

I don’t really know the mathematical mechanics behind WARP so I should probably keep my mouth shut, but maybe someone can help me understand. Let’s return to Bonds for the moment.

1720 BRAR, 355 FRAR, 2075 RAR when added, 220 WARP1. So if you divided the RAR by the WARP you get about 9.4 RAR/WARP. Which seems in the ballpark. Again, I don’t know how he goes from BRAR(P) and FRAR(P) to WARP, the method’s not detailed in the glossary, but the presentation of the numbers strongly suggests that there’s some additive process going on followed by a runs-per-win calculation.

#15    David Smyth, 2006/10/04 @ 07:31 AM

Dr. C, I don’t think anyone is saying that Davenport is making a math error. I think they’re saying that his choice will have little practical value in solving realistic baseball problems or questions.

#16    , 2006/10/04 @ 07:48 AM

David,

I didn’t intend to imply he was making a math error, sorry if that’s how it came out. What I’m trying to say is that it does look like he’s asking us to add things up, but when you add them up as presented, the results (33% fielding value for Hodges), while internally correct within the WARP system, don’t jibe with what we know about the relative importance and value of the tasks that first basemen perform (mashing and catching other infielders’ throws) and how well they generally perform them.

And I agree with statement that .350 players on offense and defense don’t make a .350 team, so therefore, I’m led to believe that Davenport is over- or understating either batting or fielding relative to the other. At that point, I’m inclined to believe that the defensive replacement value is more likely to be the culprit in the misbalancing than the offensive one.

Hope that’s clearer, sorry for any confusion.

#17    David Gassko, 2006/10/04 @ 07:58 AM

The problem, Dr. Chaleeko, is that there is no such thing as offensive and defensive replacement level. There are replacement-level PLAYERS, and that’s all that matters.

#18    Tangotiger, 2006/10/04 @ 08:00 AM

David, agreed that there is no “the” replacement level.

A team replacement level of .150 means, in effect, a nonpitcher replacement level of .270 and a pitcher replacement level of .330. 

In “runs” terms, that’s a pitcher who allows 6.5 RPG, while his offense and fielding are league average, in a 4.5 RPG environment. 

For a nonpitcher, providing replacement level “off+def” with average pitching, that means a team of such nonpitcher will score 2.9 runs, and allow 5.0 runs, (again assuming the average pitcher will allow 4.5 RPG with average fielders).  Or, 3.2 for the hitters, and 5.5 for the defense.

Do pitchers who allow TRUE TALENT 6.5 RPG see the light of day in MLB?  Sure, I suppose.  But, not for long.  For every TRUE TALENT 6.5 RPG pitcher, there’s probably at least 10 of them in the minors better than him.  It’s either poor evaluation, or “back against the wall” time.  Is this a valid baseline?  I don’t think so, since this guy is preventing a better pitcher from pitching, in effect, causing negative wins. 

It’s like making Bill O’Reilly the replacement-level talking head, and you think anyone else smarter than him adds to your mind.  In fact, O’Reilly prevents other people from being there, and listening to O’Reilly actually makes you dumber, having wasted your life absorbing anything he said, then trying to shed it.

If Clay wants to say that anyone above O’Reilly is positive, then he should say so.  Most people would draw the line far above that, as a minimum level.  In an emergency, and if I’m too lazy to get the remote, and if my kid changed channels from Nick to Fox News, then ok, O’Reilly is my replacement level.

#19    dq, 2006/10/04 @ 08:15 AM

But if he is a .400 hitter and a .400 fielder, then he is a .307 player. A perfectly balanced .307 team has .400 offense, .400 defense, and .400 pitching. Each player is .400/.400, or a .307 player.

#20    Tangotiger, 2006/10/04 @ 08:44 AM

dq: the point is that the .307 nonpitcher is too low.  It should be .380.  That’s around a .395 offense, .485 fielding.

A .300 team is .380 nonpitcher, and .380 starter, and .470 reliever.  Or, .380 nonpitcher, .410 pitcher.  Or, .395 offense, .395 defense.

#21    dq, 2006/10/04 @ 11:01 AM

Two questions to help me out here:

How is the .485 fielding derived?

How is the pitching/fielding split worked so that a .410 pitcher with a .485 fielding gets .395?

Thanks

#22    David Smyth, 2006/10/04 @ 02:07 PM

I was browsing that BBTN N Silver chapter alluded to by D Gassko. He presents 2 graphs:
1) the chance of making the postseason at each team win level
2) the economic value to a team of each incremental win

In the latter case, it’s a bell curve with a peak at around 89 wins. In the former case, it’s an S curve with a maximal change point also at around 89 wins.

But in both cases, the graphs appear to reach their low points at 78 wins. That’s 3 wins below avg, or about -30 runs.

That is, a team shows some tangible benefit, in terms of postseason probability or income, from being -25 runs or better--but does not produce a tangible loss from being -35 runs or worse--all relative to this -30 run center.

So, maybe that is another way to define repl level, as the zero point in terms of the real-world impact on the team. Using the standard technique of substituting a repl level player into an avg team, that would imply a level of -30 runs (more precisely, -3 wins) per 162G--even though the FAT level might only be -20 per 162, or so.

Just an idea, from seeing those graphs.

#23    tangotiger, 2006/10/04 @ 02:45 PM

dq, using the Odds Ratio method.  However, a quick shortcut is the differential method:

.500 + (.410 - .500) + (.485 - .500) = .395

***

The .485 was solved for, to ensure that both off and def came out to .395.

#24    dq, 2006/10/05 @ 08:21 AM

At first glance the .485 seems high, and don’t like to defend it just because it works. Does it makes sense? Is a replacement fielder almost as good as a regular fielder?

So, I did a mini test. I took all AL ss from 2000-2005 (Lahman Database), and computed assists/total outs. By team, I compared the players with the most total outs (the starter) to the second player, and calculated that as a ratio.
For teams where the 3rd player played at least 200 total outs (~ 8 games), I also computed their ratio to the starter.

Since I’m comparing team to team, I should be okay not to worry about park effect,Gb/fb, or LH/RH. I used the 200 out category for the 3rd player to eliminate guys with 1 inning played at the position. I didn’t do that for the backup, so I don’t eliminate a superstar from the base. I used ss since it’s a premium defensive position.

Based on these numbers, the backup was actually 1.036 better than the starter. The 3rd guy was .984 as good as the starter.

So, .485 makes sense.

So, a replacement team is .309, based on Det, Ariz - Big Unit, and Tigers minus Higginson.

Offense is .395, which is ~ to .79 MLE for AAA
Pitching is .41, based on various studies.
Fielding is .485, which fits the equation, and is validated by 2nd & 3rd string SS.

#25    Tangotiger, 2006/10/05 @ 08:56 AM

The important thing to remember, always, is that we have replacement players, not replacement hitters and replacement fielders.  This isn’t the NFL.  So, the .485 fielding that I used has to be taken in the context of the .395 hitter.  There are hundreds of MLB players that are worse than .485 fielders, and there are a handful that are worse than .395 hitters.

So, .380 SP, .395 hitting, .470 RP, .485 fielders.

Furthermore, we should always remember this:
http://www.tangotiger.net/talent.html

And finally, we must always remember that performance data is OBSERVATION, which means that Observed = True + Luck, which means that you must always try to figure out how much luck influenced the observation.

Once you appreciate all this, you will get down right away what a team of replacement level of players will perform at, and it’ll be close to .300.  If someone wants to say, based on all this, that it’s really .270 or .330, fine.  That’s a quibble.  .150 needs far more explanation.

Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main