THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, February 14, 2007

How reliable are PECOTA Forecast Percentiles

By Tangotiger, 03:22 PM

Chris Carpenter is arguably the second best pitcher in baseball.  Santana is first obviously, and after him you have a cast of characters, like Carpenter, Halladay, Oswalt among others.

Forecasts among five forecasting tools are equally impressed with him.  His K minus BB per 9 IP are forecasted as:


5.51 (ZIPS)
5.47 (Chone)
5.36 (Marcel)
5.08 (PECOTA)
4.73 (Bill James)
---------------
5.23 Average

They all love him.  He is 51-18 in his last 3 years.  If there’s one thing we can be certain about is that Chris Carpenter is one of the best pitchers in baseball.

Adam Wainright looked great last year.  But, are we as sure about his greatness as were are with Carpenter?  Of course we can’t be.  We’ve got history to judge Carpenter, and we don’t have anywhere near as much with Wainright.  On top of which, since pitchers perform much differently as starters or relievers, this adds another level of uncertainty. 

His K minus BB per 9 IP are forecasted as:

5.66 (Chone)
4.74 (Bill James)
4.57 (Marcel)
4.36 (ZIPS)
4.33 (PECOTA)
-------------
4.73 Average

How they see them
Marcel is a great barmoter, since we know exactly how it’s calculated, and we know exactly what kind of assumptions it makes.  We see here that Chone expects much better performances from these two guys than Marcel, while PECOTA expects worse performances than Marcel.  In effect, Chone sees their historical numbers as being more “real” than does PECOTA.  Marcel has properly valued them, given the limited amount of data it looks at.  Chone has decided by looking at even other data, that they’re better.  PECOTA, looking at also other data, decided that they are worse.

Bill James (or I guess BIS) has decided that Carpenter isn’t as good as his numbers, and Wainright is better than his numbers.  ZIPS takes the opposite viewpoint.

All fun and games really, and not really germane to this blog entry.  What is more important is that the standard deviation of the forecasts is much smaller with Carpenter than it is with Wainright: 0.29 to 0.49.

This is of course expected, and any other result would have been seen with skepticism.

Reliability
Marcel also supplies “reliability” scores for each forecast, with Carpenter at 0.83, just a shade below the leader, Santana, at 0.84.  What the reliability score measures is how much of a player’s stats we can trust.  In the case of Carpenter, we regress his historical data 17% toward the league mean.  Not much really.  Wainright on the other hand has a reliability score of 0.46, meaning we regress over 50% of his stats toward the league mean.  Qualitatively, this makes perfect sense.  While the numeric representation of the reliability scores may not be obvious, that we get such a wide difference is.

Enter the PECOTA percentile forecasts.  PECOTA does something similar to Marcel, and further expands it by introducing performance lines for various percentile levels.  If we focus on the 25th and 75th percentiles, PECOTA tells us that it’s 50% sure that Wainright’s K minus BB per 9 IP will be between 4.53 and 3.97, or a gap of 0.56.  And with Carpenter?  Between 5.33 and 4.58, or a gap of 0.75.

Isn’t that the opposite of what we expected?  And if we look at the “equivalent peripheral” ERA, Carpenter is 3.18 to 4.37, or a difference of 1.19, while Wainright is 3.73 to 4.96, or a difference of 1.23.  They both have the same level of uncertainty.  And, of course, that can’t be.

Why?
Why does PECOTA do that?  I suspect that PECOTA first establishes the comparable pitcher list, and once that is established, the unreliability of the comps is thrown out the window.  That is, with Carpenter, we have a solid track record as to how good he is.  We’re pretty sure of it.  The “0.83” of Marcel.  With Wainright, not so much (Marcel reliability of “0.46").  But, once the comp list is created, the uncertainty of those comps is likely no longer considered.

Now, PECOTA does know more about Wainright than Marcel, since Marcel is intentionally oblivious to minor league performance, and PECOTA is not.  But, how much higher can that reliability estimate go?  0.55?  0.60?  Marcel has 135 pitchers with a reliability estimate of at least 0.70.  If we consider all minor league performances as well, can Wainright jump into that pool of pitchers?  I don’t see how he could.

Finally, is being 50% sure that a pitcher will have a peripheral ERA within a 1.20 range something great?  Let’s assume you have a pitcher with an OBP of .340, and he will face 700 batters.  What would the binomial give us in terms of the 25% and 75% percentile levels?  That would be .328 and .352, which translates into an ERA of 4.13 and 4.90, or a difference of only 0.77.  If you were 100% sure that a guy had a .340 OBP, you’d expect to see a peripheral ERA 50% of the time within that 0.77 range. 

But, we are not 100% certain of our true mean.  It seems to me that our uncertainty level around our true mean must be fairly high, if we can only be 50% sure that a pitcher’s peripheral ERA can only be estimated at within a 1.20 ERA range.

In short, I see no reason to believe that the forecast ranges of PECOTA actually represents what it purports to.  And I see no reason to believe that PECOTA’s uncertainty of a player’s forecast is dependent on how much information we have about that pitcher.

#1    Jerome      (see all posts) 2007/02/14 (Wed) @ 18:15

"If we focus on the 25th and 75th percentiles, PECOTA tells us that it’s 50% sure that Wainright’s K minus BB per 9 IP will be between 4.53 and 3.97, or a gap of 0.56.”

The percentile forecast isn’t supposed to be used in that fashion.

“The percentile forecast is designed to work for the Key Statistic (EqR/27 and EqERA) only. If a player’s 90th percentile forecast for home runs is 42, this should not be read to mean that he has a 10% chance of hitting 42 home runs (or more). Rather, it means that he has a 10% chance of having a season as valuable as the line represented by the 10th percentile forecast, whether this comes from the particular combination of peripheral statistics listed in the percentile line, or an equally valuable (but different) combination of statistics.”
http://tinyurl.com/2trvhu

The BETA score is more akin to MARCEL’s “reliability”. 

BETA is “[A] measure of the relative volatility of a player’s EqA or EqERA forecast, as determined from his comparables.  The Beta for an average major league player is 1.00; players with Beta’s higher than 1.00 have more volatile forecast than others ("riskier"), while Betas lower than 1.00 represent less volatile forecasts ("less risky")."
http://tinyurl.com/36q6vx

Waiwright’s BETA is .82, Carpenter’s 1.09.  Johan ‘s BETA is 1.00, while Halladay is at .82


#2    mb22      (see all posts) 2007/02/14 (Wed) @ 19:09

Wouldn’t PECOTA’s somewhat unreliable forecast of Carpenter be due to his relatively strange career path, and list of comps ranging from guys like Clemens to Erik Hanson.

On a side note, it’s also interesting to look at Carpenters BABIP’s when in Toronto(usually well over .300) compared to St. Louis(about .284 so far).

Anyway, good stuff. Keep up the great work.


#3    tangotiger      (see all posts) 2007/02/14 (Wed) @ 20:06

The percentile forecast isn’t supposed to be used in that fashion.

Yes, it is.  A peripheral (or component) ERA is made up of ... components.  And the two most important ones are K and BB, which is why I chose them.

In any case, the gap in peripheral ERA still stands.

***

As for the Beta, there’s no possible reason that Wainright’s Beta would be lower (less risky forecast) than Santana’s.  You just proved my point.

***

Carpenter’s last 3 years represents the bulk, if not the totality, of his performance data.  How he got to his Cardinals days seems to be irrelevant to Pecota (and Marcel, and most other forecasts).

In any case, you can go up and down the line for any pitcher or any team, and you will find that the peripheral ERA at the 75th percentile will be around 70% of his peripheral ERA at the 25th percentile.  (Picking Cards pitchers at random: Izzy is 68%, Looper 78%, Mulder 69%, Reyes 74%, Wells 70%)

If that’s the case, then what’s the point?

***

Check out this player with zero MLB experience:
http://www.baseballprospectus.com/pecota/GARCIA19860708A.php

The 25 to 75 percentile peripheral ERA is between 4.20 and 4.94! 

How is it even possible that the certainty level for this pitcher is so much stronger than for any other Cardinal pitcher?


#4    mb22      (see all posts) 2007/02/14 (Wed) @ 20:27

This is just somewhat of a guess, but look at the difference in BABIP with Carpenter between the 25 and 75 percentiles. It goes from .273 to .299. That seems to be a lot wider gap than pretty much all of the other guys I’ve looked at. Take a guy like Santana(.272 to .282) or Oswalt(.286 to .300). Perhaps, PECOTA doesn’t think he can keep that kind of BABIP because of his former struggles in Toronto, or something else? It seems that may be causing the wider variance in expected performance?


#5    Rally      (see all posts) 2007/02/14 (Wed) @ 21:57

I have the best projection for Wainright only because I made the decision to project him as a reliever.  Some of the other projections - certianly PECOTA - may be looking at him as a starter.  PECOTA has him at 151 innings.

I agree with your reservations about the percentiles.  I don’t think there’s any way we can have as reliable a forecast for a 19 year old who’s never pitched above A as we can for a major league veteran.


#6    tangotiger      (see all posts) 2007/02/14 (Wed) @ 21:59

If you read the blurb from Jerome in #1, the percentile is based SOLELY in the EqERA (i.e., peripheral ERA).  PECOTA is not doing percentiles for each of the components.

In any case, it seems the BABIP is based more on the team’s fielders, since almost all the Cards pitchers shows the huge split you are describing.... except of course for Garcia, he of no MLB experience.

We should make a fine distinction between the mean forecasts, which are good, from the percentile forecasts, which are unproven, and therefore, unreliable.


#7    tangotiger      (see all posts) 2007/02/15 (Thu) @ 09:52

Timely article on forecasters:

http://sports.espn.go.com/espn/page2/story?page=keri/070214


#8    Nate Silver      (see all posts) 2007/02/15 (Thu) @ 17:34

I’m probably not going to get involved in this debate at length, but a couple of points to note:

1. Wainwright is an odd data point to use because we’re having the system perform a relief-to-starter conversion on him.  We don’t try and look at other pitchers that made the relief to starting jump, but instead project Wainright as a reliever and then basically apply a linear multiple to his forecast based on his peripheral statistics.

2. EqERA is not the same as PERA.  Rather, it’s ERA normalized for league and park effects.  PECOTA can and does account for systematic differences between ERA and component statistics.

3. The Beta scores perform intuitively far more often than not.  Of the 15 pitchers with the highest Betas, 12 are pitchers who have never pitched in the majors.  The others are Randy Johnson, Jon Lieber, who is old and has a BB rate that is fairly historically unprecedented, and Jesse Foppert. 

4. I think people are overestimating the amount of information that a well-designed forecasting system needs.  Numbers like strikeout rate, walk rate, and groundball rate stabilize very fast.  If you plot information on one axis and forecast accuracy on the other, the curve is very steep at first but becomes flat sooner than you’d think.

5. PECOTA looks at usage data as well as rate performance when selecting its comparables, including major league career length.  Many of Wainwright’s comparables, for example, were pitchers entering their second or third full big league season.

6. Very good pitchers like Johan Santana and Chris Caprentner are inherently more unusual than garden variety pitchers like Adam Wainwright.

7. When we’ve actually tested the percentile ranges for PECOTA, we’ve found that they’ve done a very good job by and large (though better for hitters than for pitchers).


#9    tangotiger      (see all posts) 2007/02/15 (Thu) @ 17:53

Nate, thanks for stopping by. 

I think your #5 is important.  Wainright is rather old to start being such a good pitcher in the majors (he’ll be 26 in August).  So, it’s a given that his comps will be guys with multiple years of MLB experience behind him. 

However, that doesn’t mean that his forecast range should be similar to everyone else.  The uncertainty level around a forecast is, predominantly, based on how large your sample is.  And Wainright, regardless if he scored comparable to lots of guys or few of them, still only has 318 TBF.

However, if you include minor league data, where a guy gets to add tons of TBF prior to joining the bigs, and therefore, gets a larger sample, this the uncertainty will go down.

At the same time, any minor league data behind used will itself have greater uncertainty than MLB data.

***

In your testing of the PECOTA ranges, did you test between rookies and sophs, compared to the rest?  I’d love to see whatever testing you have done.


#10    tangotiger      (see all posts) 2007/02/15 (Thu) @ 17:54

behind used = being used


#11    tangotiger      (see all posts) 2007/02/15 (Thu) @ 18:40

Another test would be to look at all those guys with a narrow peripheral ERA range (the Garcia) and and look at all those guys with the widest peripheral ERA range. 

And when you look at the actual, did the peripheral ERA of the guys in the narrow range actually produce ERAs much narrower than the guys in the wide range?

***

Your statement about comparing hitters and pitchers seems to be curious.  If you test for 25th to 75th percentile, you should expect to get 50% within that range, regardless if you are a hitter or pitcher.  If you don’t, you adjust the percentile levels to ensure that.

In short, hitters or pitchers, experienced or inexperienced, wild thrower, or contact hitter, speedsters, or power pitchers, all should fall within the percentile ranges, at the group level.  That’s the real test.


#12    David Gassko      (see all posts) 2007/02/15 (Thu) @ 18:59

Nate,

Have you tested percentiles based on comparable players versus percentiles based on a player’s component statistics (you could choose many different methods here, from a simple binomial random error, to some more complex regression-based analysis)? Intuitively, it doesn’t seem clear to me that the comparable players would do a better job, given the small sample size that’s inherent in your comparable samples, that’s exacerbated when you divide it into smaller groups for percentiles.


#13    Nate Silver      (see all posts) 2007/02/15 (Thu) @ 19:53

"The uncertainty level around a forecast is, predominantly, based on how large your sample is.”

See, this is where I disagree.  Or at least I potentially disagree.  There are a whole host of factors that can impact variability above and beyond sample size: for example a player’s age, what type of player he is, his age, and the robustness of his comparbles set.  In fact, once you get past a certain minimum level of sample size, I suspect that these considerations tend to be more important than the sample size itself (especially if PECOTA is also able to model the sample size in its comparable selection to a certain extent, which it does).  I don’t have any real problem with Adam Wainwright’s sample size, for example, as long as you’re able to account for his minor league numbers in a reasonable way.  The bigger problem with Wainwright is unrelated to sample size, which is his starter-to-reliever-to-starter conversion.

When we looked at the percentiles, this was focused on MLB data.  The hitter percentiles were almost perfect.  The pitcher percentiles underestimated the number of extreme performances at the 10th and 90th percentiles by about 5%, but this had to do with the way that PECOTA was “mapping” the comparable performance and should since have been corrected.  I don’t know whether they held up for any given subgroup of pitchers or hitters; we didn’t get that far.  But there definitely are systematic differences in the empirical data.  For example, power hitters have larger forecast ranges than contact hitters, and players with speed have less variability.


#14    tangotiger      (see all posts) 2007/02/15 (Thu) @ 22:28

Good stuff.

I think we probably disagree.  The sample size is the largest determinant, but you also have age, since, as we know, the slope of the aging curve is steepest as a player is young (pre 25), and is steep the other way when he is old (post 35).  So, the really young and old players have an extra adjustment that expands on our uncertainty.  But, nothing, I don’t think, to match the uncertainty of the sample size.

As for accounting for the size of minor league performance, this would be interesting to determine.  For example, if you have 300 PA in MLB and in minor leagues, they can’t possibly both provide the same level of uncertainty.  The minors have the extra adjustment factor (quality level is different, plus an additional translation to go from one league to another, and an age adjustment if you go beyond same-year).  So, these adjustment factors adds to our uncertainty level.  How much, I don’t know.  Maybe someone else can tackle this as a project.

I suspect that 300 PA in MLB will be equivalent to 450 or 500 PA in minors, when accounting for the uncertainty level.

Fun stuff for sure.

***

Can you explain the narrow band for such a young and inexperienced pitcher like Garcia?


#15    obsessivegiantscompulsive      (see all posts) 2007/02/16 (Fri) @ 05:02

FYI:  to clarify for those who haven’t bought the book, the pitching projections in the Bill James Handbook are not Bill James, per se, though are in his book.  It notes in the section’s introduction that Bill James does not believe that pitching numbers can be projected and so he had nothing to do with the pitching projections.  They are the “brainchild of John Dewan” with help from others at BIS.


#16    Tangotiger      (see all posts) 2007/02/20 (Tue) @ 12:37

Nate Silver responds in a chat, of likely some reader of my blog posting:

Marcel (The Zoo): No one understands or uses the percentiles in Pecota. Why do you keep them?

Nate Silver: Marcel,

A lot of us do use the percentile forecasts—they’re useful for example in describing just how unusual a certain performance might have been. People thought that A-Rod had a really terrible season last year, for example, but his performance was right in line with his 25th percentile forecast, hardly anything too out of line.

FWIW, when we looked at the percentile distributions a couple of years back, we found that they did a nearly perfect job for hitters. That is, a player would exceed his 90th percentile forecast about 10 percent of the time, his 75th percentile forecast about 25 percent of the time, and so forth. Pitchers did nearly as well, except that guys were exceeding their 10th and 90th percentiles a little too often—e.g. there were a few more extreme performances than PECOTA was giving credit for. I think I’ve found the reason why this was the case and it’s since been corrected.

Nate is correct that it does describe how unusual something is.  However, I disagree about the need to have an individualized range, since I believe that range is predicated on the amount of historical data far, far more than on the profile of the player.  This is the “agree to disagree” until someone--him, me, or someone else--comes along to set the other straight. 

Alf (Cambridge, MA): Will you be comparing how Pecota fared against other projection systems again?

Nate Silver: Basically, every study I’ve seen, whether its ours or someone else’s, has shown PECOTA at the top of the heap. But I’m less interested in looking at this stuff than you might think. The differences in forecast accuracy for major league players is going to be relatively small, provided that a system is able to do forecasts with some minimum level of competency. From a research standpoint, I’m more interested in things like the prospect forecasts and the five-year forecasts, where there’s much more room for marginal gains, as well as the valuation elements of PECOTA.

I agree with Nate’s position here.  The real test is the guys on the periphery.  For example, Marcel has Pujols at .331/.424/.635, while ZIPS has him as .320/.423/.643.  Regardless of how Pujols does, neither Marcel nor ZIPS will lay a claim on having better forecast his season.

A great test would be Hanley Ramirez: .298/.361/.482 for Marcel v .277/.338/.439.  Marcel intentionally ignores minor league stats, because he’s lazy.  ZIPS, Pecota, Chone, Bill James use minor league data.  Each uses them in different ways.  That’s the fun part.

Since Nate was kind enough to participate in a project I’ll be studying (which will include the above named forecasts, plus Pete Palmer, and MGL), we’ll see just how each system treats the young and inconsistent ballplayers.

Part of Nate’s position on caring though belies this:

COMING SOON Complete depth charts and forecasts for AL and NL pitchers and hitters using Baseball Prospectus’ deadly-accurate PECOTA projection system--

If someone is making such a grandiouse claim, it would behoove them to back it up every year.  So, either you care and you need to back it up, or you don’t care and don’t need to back it up.  Either way works for me.


#17    The Real Neal      (see all posts) 2007/02/24 (Sat) @ 13:17

Some other factors that I have never seen an attempt to be quantified into projections , but theoretically could be:

1. Quality of Coaching - Is there a Duncan factor?
2. Situational Stress - Are you the 7th batter for a crappy team or the third place hitter for a low scoring playoff contender?
3. Defense to Hitting Carryover- Error rate and K rate: to what degree do they go hand-in-hand?
4. Weather- Some like it hot
5. Synergy - Playing in front of a good defense, should not only decrease your BABIP allowed, but decrease your walk rate and consequently increase your innings pitched.

I have to disagree that sample size is the key to ‘reliability’, because sample size in actuality is always 0. Since you have no players with the exact same statistical record to make comparisons, you’re already fudging your sample size by relaxing the standards.  If sample size was the most important component, just make your sample every pitcher who ever played.  At that point, you may as well just use 3-year averages.


#18    tangotiger      (see all posts) 2007/02/24 (Sat) @ 13:32

You don’t need “an exact”.  Your sample size is your sample size.  Even if you have 10 guys who were “exactly” like Chris Carpenter, and none who were like Roy Oswalt, your sample size is your sample size.  And, the uncertainty level, which is really what you are talking about, would almost certainly be very similar, since having 10 guys exactly like Carpenter still won’t trump your sample size.


#19    The Real Neal      (see all posts) 2007/03/11 (Sun) @ 09:31

Anytime you use statistical analysis to predict human behavior, you’re on shaky ground.  Did you ever read The Foundation Series by Isaac Asimov?  He thought it was possible to predict trends in human behavior, given a sample size of 10’s of billions.  But now we’re talking about samples of 10’s and 20’s, the using of which would get you an F in a college statistics course.  That’s why it’s fair to give the probability percentiles, set aside from confidence intervals.  Of course when looking at them as a whole they tell you, if the player has a good year he is going to be good, if a player has a bad year he is not going to be good.  Which most people know anyway.

All that being said, I think any local baseball fan who follows the daily press clippings and has a basic understanding of how chance plays a part in a player’s season is likely to give you a better forecast for an individual player than any of these ‘mass’ systems will.  They’ll certainly give you better counting stats predictions than Pecota, because Pecota is terrible at projecting innings and PA’s.


#20    tangotiger      (see all posts) 2007/03/11 (Sun) @ 12:25

We’re about to find out, when I launch the Community Forecasts next week.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential