THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, March 11, 2010

Forecasting Home Runs in 2009

By Tangotiger, 06:04 PM

From 2006-2008, the MLB home run leaders were: Ryan Howard (58), Alex Rodriguez (54), and Ryan Howard (48).  It seems safe to say that the league-leader in home runs in 2009 should have been somewhere close to 50.  But, who could we have guessed in March of 2009? 


There were 31 players who hit at least 35 home runs at least once during those three years:

n35    HR3    HR162    Player
1    87    30.9    Bay
Jason
1    101    36.1    Beltran
Carlos
1    108    38.2    Berkman
Lance
1    71    43.0    Braun
Ryan
1    97    33.3    Cabrera
Miguel
2    100    36.6    Delgado
Carlos
3    120    42.7    Dunn
Adam
1    106    40.8    Dye
Jermaine
1    112    38.8    Fielder
Prince
1    83    40.2    Giambi
Jason
1    85    34.5    Glaus
Troy
1    90    30.7    Gonzalez
Adrian
1    71    34.2    Hafner
Travis
1    64    28.7    Hall
Bill
1    95    33.2    Holliday
Matt
3    153    52.2    Howard
Ryan
1    70    31.3    Jones
Andruw
1    88    34.4    Konerko
Paul
1    97    36.3    Lee
Carlos
1    51    37.3    Ludwick
Ryan
2    112    42.5    Ortiz
David
1    78    43.5    Pena
Carlos
2    118    42.3    Pujols
Albert
1    50    34.2    Quentin
Carlos
1    91    34.2    Ramirez
Aramis
2    92    36.2    Ramirez
Manny
3    124    43.9    Rodriguez
Alex
1    108    40.9    Soriano
Alfonso
1    81    29.5    Swisher
Nick
1    73    34.7    Thomas
Frank
2    111    44.5    Thome
Jim

To read this chart: Ryan Howard hit at least 40 HR three times, for a total of 153 home runs from 2006-2008.  His average HR rate per 700 plate appearances (the equivalent of a full 162 game season) was 52.2.

It’s safe to say that if we had to guess on a home run leader for 2009, that it would be one of these 31 players. Just for the sake of illustration, let’s put in some semi-intelligent odds of each player winning the HR title in MLB in 2009 as follows:

n35    HR3    HR162    Odds    Player
3    153    52.2    10
%    HowardRyan
1    112    38.8    6
%    FielderPrince
3    120    42.7    6
%    DunnAdam
3    124    43.9    6
%    RodriguezAlex
1    71    43.0    6
%    BraunRyan
2    118    42.3    6
%    PujolsAlbert
1    78    43.5    4
%    PenaCarlos
2    111    44.5    4
%    ThomeJim
1    106    40.8    4
%    DyeJermaine
1    97    33.3    4
%    CabreraMiguel
2    100    36.6    4
%    DelgadoCarlos
1    108    38.2    4
%    BerkmanLance
1    101    36.1    4
%    BeltranCarlos
2    92    36.2    2
%    RamirezManny
1    108    40.9    2
%    SorianoAlfonso
1    90    30.7    2
%    GonzalezAdrian
1    51    37.3    2
%    LudwickRyan
2    112    42.5    2
%    OrtizDavid
1    83    40.2    1
%    GiambiJason
1    95    33.2    1
%    HollidayMatt
1    91    34.2    1
%    RamirezAramis
1    85    34.5    1
%    GlausTroy
1    87    30.9    1
%    BayJason
1    97    36.3    1
%    LeeCarlos
1    88    34.4    1
%    KonerkoPaul
1    50    34.2    1
%    QuentinCarlos
1    81    29.5    1
%    SwisherNick
1    64    28.7    1
%    HallBill
1    73    34.7    1
%    ThomasFrank
1    71    34.2    1
%    HafnerTravis
1    70    31.3    1
%    JonesAndruw
            9
%    Someone else

The total has to obviously come out to 100%.  If we look at Pujols, we gave him odds of 6%, which is the next highest number after Ryan Howard.  If the top-end expectation for Pujols is roughly 50 HR, then the average HR expectation will obviously be less than 50 HR.  Let’s give Pujols this kind of HR expectation, again, purely for the sake of illustration:

50+    6%
45-49    9%
40-44    12%
35-39    15%
30-34    18%
25-29    12%
20-24    10%
15-19    8%
10-14    5%
5-9    3%
0-4    2%

That seems like a reasonable kind of range.  It includes his chance of injuries or possible bad year (for him).  And it includes the chance of him winning the HR crown.  The average of the above is 31 HR.  So, when you look at a forecast for the number of HR for Pujols, and you see “31”, that number actually means “I have no idea how many HR he will hit, other than it will be centered around 31, give or take 20 or 30 HR”.  And that’s pretty much the best we can do.

Can we prove that?  A simple forecasting system I developed is called Marcel The Monkey Forecasting System, or The Marcels, for short.  It’s named after the monkey from the TV show Friends.  I also like the name Marcel for the hockey great Marcel Dionne so even if the name looks dated, you can think of Dionne instead.  Anyway, Marcel listed 13 players as having a forecasted mean of 28 or more home runs for the 2009 season.  Here are those hitters:

40    HowardRyan
32    Rodriguez
Alex
32    Fielder
Prince
32    Dunn
Adam
32    Braun
Ryan
31    Pujols
Albert
31    Pena
Carlos
30    Thome
Jim
29    Dye
Jermaine
28    Delgado
Carlos
28    Cabrera
Miguel
28    Berkman
Lance
28    Beltran
Carlos

Now, remember what I said, and this is important: we are NOT forecasting Pujols to hit 31 HR in 2009.  We forecasted him to hit 31 HR give or take 20 or 30 HR.  You apply that same kind of thinking for each of the above players.  And, we are NOT forecasting Ryan Howard to led the league with 40 HR.  We ARE forecasting SOMEONE to hit around 50 HR.  And these guys our among our best bets.  With the top-end of each of these hitters close to 50 HR, obviously the average will be much lower.

How many HR did these players hit in 2009? 

47    PujolsAlbert
46    Fielder
Prince
45    Howard
Ryan
39    Pena
Carlos
38    Dunn
Adam
34    Cabrera
Miguel
32    Braun
Ryan
30    Rodriguez
Alex
27    Dye
Jermaine
25    Berkman
Lance
23    Thome
Jim
10    Beltran
Carlos
4    Delgado
Carlos

As you can see, it runs the gamut from Delgado’s 4 to Pujols’ league-leading 47.  These 13 hitters were forecasted to hit a combined 401 HR in 2009.  And how many HR did they actually hit in 2009?  400.  That’s right, Marcel nailed it.

So, the forecasting systems work… if you know how to properly interpret what it is they are trying to tell you.

#1          (see all posts) 2010/03/11 (Thu) @ 22:42

For me, your illustrative distribution for Pujols tells a lot that most non-stats folks won’t get (and I see this in many more places than just baseball)… I think people, in general, don’t understand that when you “project” Pujols to hit 31 HR that you really only think he has ~20% chance of being within a couple HR of that actual number… If someone knows how to convey the distribution behind the projection in a way that won’t make non-stats people’s eyes glaze over, I’d be interested in hearing about it…


#2    Zach      (see all posts) 2010/03/12 (Fri) @ 00:09

#1/Barry: I think it’s quite simple. CHONE projects 39 HR in 634 PA for Pujols, which gives a standard deviation of almost 6 HR exactly. So he has a ~16% chance to hit 45+ HR, a 16% chance to hit 33 or fewer, and a 68% chance to hit 33-45 HR. And he has a ~2.5% chance to hit either 51+ or <27 HR.


#3    Michael Bodell      (see all posts) 2010/03/12 (Fri) @ 03:22

I don’t think you need to be a stats person in general to appreciate probability.  If you make the simplifying assumption that each of Pujols’s PA are identicle in the chances of him hitting a HR, which while obviously false hopefully doesn’t effect the estimate to much, then doing what Zach does and saying that 39 HR in 634 PA means that the chance he hits a HR in any given one PA is about 6.151%.

Now we model it as a coin that comes up heads 6.151% of the time (HR) or tails the rest of the time (not HR).  Or, to be even more general but give people a better intuitive feel if they play cards, this is about the chance of you drawing a card from a normal deck and it being a non club Ace (so one of the spade ace, heart ace, or diamond ace).  So if you shuffle a deck of cards and draw one card and say Pujols hits a HR if you get a non club Ace and a non-HR otherwise you could experiment yourself drawing 634 PA worth of card draws (being sure to reshuffle after each simulated PA).

Instead of simulating it by shuffling a lot of decks of cards, we can use not too advanced math to calculate the answer.  Showing our steps, the variance therefore is 634 * 6.151% * (1-6.151%) = about 36.6 and the standard deviation is the square root of this which is about 6.05.

And if we then use a lookup table (wikipedia entry on standard deviation, for example) for the normal distribution we can see that we predict
Pujols will hit (where the lower limit is calculate in parens using the mean estimate of 39 and the standard deviation of 6):

>57 HR around 0.1% of the time (39 + 3 * 6)
51-57 HR around 2.1% of the time (39 + 2 * 6)
45-57 HR around 13.6% of the time (39 + 1 * 6)
39-45 HR around 34.1% of the time (39)
33-39 HR around 34.1% of the time (39 - 1 * 6)
27-33 HR around 13.6% of the time (39 - 2 * 6)
21-27 HR around 2.1% of the time (39 - 3 * 6)
<21 HR around 0.1% of the time

That, I think, is the sort of explanation that can work for a non-stat person.  If you want you can even go into the caveats and complications with a paragraph like:

There were a couple of assumptions that went into this that are clearly wrong which make especially the extreme estimates a little off.  The first one is that obviously facing Lincecum in SF for a PA has a different chance than facing the 5th starter for Colorado in Colorado does.  So some PA are less likely to be HR than our 6.151% estimate and some are more likely.  This will influence the estimates only slightly.  The bigger problem is that these numbers assume that Pujols gets exactly 634 PA.  If somehow he gets 675 PA he is much more likely than 0.1% to break 57.  On the other hand, the more realistic scenario is that if Pujols gets injured he is quite unlikely to get 634 PA and if he only gets 400 PA, say, then the chances that he gets fewer than 21 HR is much more than 1 in 1000.


#4    JD      (see all posts) 2010/03/12 (Fri) @ 04:57

I think the problem is that the projections when reported on Fangraphs or wherever else just give a stat line. It looks a lot like an actual stat line from a previous year. It doesn’t say “there’s an X% chance of this happening.” It just says “we project this.”

So that’s the first problem. What many of us presume when we get projections (even if we don’t know exact percentage of likelihood, a guy like me who understands this stuff but doesn’t do all the math knows that a 31 HR projection doesn’t mean 31 homers on the nose) is NOT what the lay person presumes. Why? Well, the nature of the “prediction.” When that loudmouth on the Boo-ya network gives a football score prediction, that’s the score he thinks will happen. There’s no “give or take a few points” with it (and hey, if he’s even kinda right, it counts as FIVE wins!). People think the same is true about baseball stat projections.

The next problem is this: So you’ve gotten the lay person to realize the stat projection has a wide range of possibilities. Their reaction is then, “So why is this useful?” If you tell a guy “No, I’m not saying Pujols will hit 31 homers. I’m saying he’s going to hit between 6 and 56” they’re going to say “I didn’t need a projection system to know that.”


#5          (see all posts) 2010/03/12 (Fri) @ 10:11

@ Barry/#1:
PECOTA does this fairly well with their ‘percentile’ predictions. If you are not familiar, it provides 25%/50%/75%/90% projections which encompasses the range of possible outcomes for a given player.


#6          (see all posts) 2010/03/12 (Fri) @ 10:26

All,

Thanks… I think JD/#4 said more eloquently and went a step further with what I was trying to get at in his last paragraph…

Also, my understanding is that the PECOTA percentiles have not been (publicly) shown to be meaningful… Although, I like the idea of presenting a range of outcomes…


#7    Tangotiger      (see all posts) 2010/03/12 (Fri) @ 11:42

That is correct: the PECOTA percentile range, while they look fine overall, are biased.  They show similar ranges for rookies as it does veterans under 30 as it does veterans over 30.

The actual range needs to be based on the uncertainty of our estimate.  And the uncertainty of our estimate is far higher for Montero and Wieters than it is for Chase Utley and Robinson Cano and Nick Markakis and their peers.


#8    Rally      (see all posts) 2010/03/12 (Fri) @ 11:53

I had the percentiles last year.  It’s something I haven’t had time to reproduce for this season, though maybe I’ll do it next year.  A study on how well it works is incomplete.

http://www.baseballprojection.com/pujolal0047.htm

So this could be interpreted that Pujols had an 80% chance to hit 25-45 homers.  The way I did it shows the range as being much wider on the downside of 50% than the upside. That’s because of playing time.  His baseline projects 149 games played.  He can top that by 13 games at most (14 if the Cards tie for a playoff spot).  On the downside, he could get hurt and miss the whole season.


#9    Greg Rybarczyk      (see all posts) 2010/03/12 (Fri) @ 12:29

I think the system needs to be tweaked at the low end. 

The 2009 Marcel for Juan Pierre predicted 3 home runs.  He had only hit 1 home run in 2007-08, in 1,043 AB’s.

This is obviously an artifact of the regression applied to prior results.

Now, of course no one really cares how many homers guys like Pierre hit, any more than they care how many bases David Ortiz steals… so maybe I shouldn’t say the system needs to be tweaked, but it does fly in the face of the explanation that the predicted value is the midpoint of all the possibilities…


#10    Tangotiger      (see all posts) 2010/03/12 (Fri) @ 12:44

Take the most egregious case: someone hits 0 HR in 2100 PA in the last 3 years.  I am going to regress 12% toward the mean.

The mean will be about 17 HR in 700 PA.  And so, he’ll get 2 HR as his forecast.

In terms of a linear method, I have to accept that.  Now, if you are suggesting that the mean should be 0.1 rather than 2.0, that’s fine.  I’ll agree with you, especially if you tell me this hitter is a fielding wizard who plays SS and weighs 150 lbs, and who has 5 career doubles.

The problem is that you introduced something (the name Juan Pierre) that conjures something to a baseball fan that Marcel is unaware of.  You are using more information than Marcel, and so, if you use that information in an intelligent fashion, you’d have to be more accurate than Marcel.


#11    Rally      (see all posts) 2010/03/12 (Fri) @ 13:39

CHONE knows a few things about Pierre that Marcel does not.  He’s listed between 180-185 pounds depending on the source.  He hits more groundballs than most players.  But even 180 pounders who hit an above average amount of ground balls hit some homers, so there’s still a good bit of regression.

The end result is I project 2 HR instead of Marcel’s 3.

The downside is almost nonexistant.  So he might hit 2 homers less than I project.  But he might shock the world and top the projection.  David Eckstein once hit 8 in a season.  Freddie Patek once hit 3 in a game.  Strange things can happen.


#12    Tangotiger      (see all posts) 2010/03/12 (Fri) @ 14:33

It’s also important to note that I cannot make Marcel smarter, because then someone would make a system dumber than Marcel.

The minimal level of competence is: past performance, regression toward the league mean, and age.  I don’t want to make anything better than that and call it Marcel.


#13    Rally      (see all posts) 2010/03/12 (Fri) @ 16:23

"because then someone would make a system dumber than Marcel.”

then?  According to JCross’s results they are already up to the task.  Including yours truly on pitcher walk rates.


#14    The A Team      (see all posts) 2010/03/12 (Fri) @ 17:36

Good post.  I drafted up a similar post recently about the purpose/intention/interpretation of a projection system that’s set to run sometime next week.  After reading this, I threw a graph of the Pujols chart into the text to illustrate the point I continuously repeated...a projection is a distribution.

#4, I answered the question of ‘Why important’ by talking about expectations.  Projection systems allow us to look at more than just the players we watch 162 times a year and form expectations for them.  This might not be important for a fan who only watches one team, but increasingly, more and more fans follow multiple teams and those that participate in fantasy leagues want to have some realistic expectation for ALL relevant players.  Lastly, it can help temper our enthusiasm/distaste for our own club’s studs and busts.

It’s not literally important to have player projections, it’s just a preference.  As fans we can go into the season without any projections and be no worse for the wear.  Humans just like to project things.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 01:57
Who is Jeremy Lin?

Feb 12 00:40
Clutch analogy

Feb 12 00:38
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential