THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, April 12, 2007

Groups of Players and Regression Toward the Mean

By , 06:36 AM

I want to talk a little bit about a misunderstood or perhaps overlooked concept in statistics as it relates to baseball (or perhaps baseball as it relates to statistics) and why it is important.  One of the hard-core stat guys who frequent or lurk on this site may have to help me out with some of the nuts and bolts but I think I have a pretty good handle on the gist of the matter.


Remember that regression towards the mean depends upon two things and two things only.  One, the sample size, and two, the spread (variance) of true talent in the population.  As each one approaches infinity, regression approaches zero, regardless of the magnitude of the other one, and as each one approaches zero, regression approaches 100%, also regardless of the magnitude of the other one.  And of course, as both get larger together, regression gets smaller, and vice versa.

Whether you fully comprehend or are able to digest that, as you read this, watch TV, eat a sandwich, and yell at the kids all at the same time, is not that important.  Know, however, that this is a very important concept in baseball (actually all sports) analysis.  Without a basic knowledge of this, you are going to be wrong in just about everything you think you know about team and player talent, projecting the likelihood of future events, etc.

But that is not exactly what I want to talk about.

Let’s say that you have an unknown (you know nothing about him other than his BA) player who hits .300 in 2000 AB’s.  What is your estimate of his true BA?  Again, that depends on two things: the number of AB’s or sample size – in this case, a pretty good amount, and two, the spread of BA talent in the population.  Without trying to figure that out (which is doable of course) we can simply (more or less, as in sports we always have selective sampling and survival bias issues that reduce random or representative samples to non-random and non-rep ones) look at “time period” to “time period” BA correlations among a large group of players and then extrapolate that number to players with 2000 AB’s.  As it turns out, among batters with around 300 some odd AB’s, you get a correlation of about .36, so that for a batter with 2000 AB’s, you get a correlation of around .77 or a regression of around .23 (remember that regression is 1-r given a roughly normal distribution).  So that .300 batter regresses to .291 assuming that league average is .260.  IOW, that is our estimate of this player’s true BA talent and that is what we would predict him to hit at any time in the future, assuming everything else, including his true talent, remains the same.

O.K. fair enough.  But what if we have a bunch of players, say an entire team, who hit a collective .300 in a collective 2000 AB’s?  Let’s call that 10 or so full-time players on a typical team around 2 months into the season.  Your first instinct might be to think that the regression would be the same, and that we would project those same players to hit .291 for the rest of the season – after all, we still have a bunch of players who are part of the league where the average BA is .260 and we still have 2000 AB’s.  Well, that first instinct would be wrong - dead wrong.  Why?  It is because the spread of talent in the population, which is one of the two criteria that determine the amount of regression, is not nearly the same in the first example (the one batter) as it is with the second (a collection of 10 batters).

In order to get a handle on a difficult concept, it is often quite useful to imagine an extreme situation, but one in which the parameters are essentially the same. I do this all the time when trying to sort out the answer to a particular problem or question.  Let’s say that our collection of players is comprised of the whole league (starters at least), around 300 full-time position players.  And let’s say that they hit a collective .300 in their first 10 AB’s (3 days or so into the season).  That is a total of 3000 AB’s, a pretty large sample of performance. I think we know intuitively that we would not expect them to hit .290 -something for the reminder of the season, which is what our single player model would predict.  Why is that?  Again, it is because our second parameter for determining the amount of regression is much different than in the single player model.  In fact, we know that the spread of talent of ALL PLAYERS within the population of ALL baseball players (technically in this example, it is 300 starters among many more players) is by definition zero.  So our regression is exactly 100% and we expect our 300 players to hit .260 for the remainder of the season.

Why is understanding this concept important?  Because it allows us to put in perspective sample performances by teams and other groups of players, such that we don’t get too excited when we see especially good or bad performances even over what seems like substantial periods of time (AB’s or IP’s) or large sample sizes.

For example, when we see a team bat .290 in 1000 AB’s, the equivalent of a little more than a month into the season, does that mean that this team is likely a great hitting team (assuming we know little else about them)?  Well, even though we are talking about 1000 AB’s, it is likely that the regression toward the mean on that .290 is pretty substantial.  For an individual player, it might be only 37% or so.  For a team, it is probably closer to 70 or 80% (I don’t really know off the top of my head), maybe more.

Ditto for bullpens.  I should say especially for bullpens! When you see a team’s collective ERA in 250 innings at the all-star break and it is 5.50 or 3.00, do NOT think of that like you would a starting pitcher after one long season.  It is not even close.  It is likely that those same two bullpens will not be that far apart in the second half of the season (maybe ¾ of a run), especially given the large regression even for a single pitcher model.

In fact, I’ll close this little ditty with some real-life examples of bullpen regression.  I looked at all teams’ non-starter innings from 1989 to 2000.  I took their first half ERA’s and divided them up, collectively, into 6 groups from best to worst.  I then looked at each group’s collective second-half ERA.  Keep in mind that truly bad bullpens (true talent-wise) tend to improve (a little), true-talent wise (not including regression toward the mean), and that really good bullpens, true-talent wise, tend to get a little worse as pitchers get injured and replaced by, on average, worse ones.  In any case, I am making the assumption that most (80-90% maybe) of the difference we see from first half to second half is regression to the mean.  Notice the huge number of innings in each half season.

# Team seasons IP 1st half ERA 1st half IP 2nd half ERA 2nd half
46 9741 3.10 11005 3.89
54 11331 3.65 12584 4.14
50 10618 4.09 11492 4.13
50 10479 4.47 10748 4.30
55 11830 4.88 12836 4.51
53 11351 5.72 11600 4.59

Amazingly, there is less than a ¾ run second-half difference between the teams with the worst and best first-half ERA’s, despite a 2.62 run difference in that first half!  That is a 74% regression to the mean in over 200 IP per team.  So the next time you hear an announcer, player, manager, or other “expert” tell us how great or terrible a bullpen (or starting staff or team, etc.) is (and what to expect in the future) after a half season, let alone a few weeks, month, or, if you read the papers, watch TV, or listen to the radio at all, A WEEK OR TWO, take their “wisdom” with a large grain of salt!

#1          (see all posts) 2007/04/12 (Thu) @ 09:47

I guess what it comes down to as this—do you think there is any underlying “team ability”?  or is a team just a collection of individuals?  It matters because if you treat it as a collection of individuals, you end up regressing each guy a lot.  I tend to think this is correct.

In principle, there could be a great pitching coach (or terrible pitching coach, or whatever) that alters the ability of the entire team as a whole.  In practice, I doubt this is often much of a factor and you can get away with treating a team as a collection of individuals.

Bullpen ERAs seem unusually well suited to regressing to the mean.  It’s a bunch of individuals with very small sample sizes and a stat that tends to fluctuate wildly to begin with.


#2    MGL      (see all posts) 2007/04/12 (Thu) @ 17:29

Well, the spread of team talent, be it offense, defense, starting pitching, or bullpen is definitely NOT based on a random collection of talent from the pool of major and minor league players for various reasons, but the point of the “article” is that team regrssion and single player regression are completely different animals given the same sample size, something that is easily overlooked and misunderstood when “evaluating” or simply commenting on groups of players while observing their collective performance.


#3    Pizza Cutter      (see all posts) 2007/04/12 (Thu) @ 19:35

What you’re talking about is a hierarchical model.  If we are looking at one player, we are dealing with repeated samplings of the same latent parameter.  If you take it up to the team level,, but keep the sample size of AB the same, then you have a two-level model, and you are dealing with two new sources of variance: additional standard measurement error due to smaller sample sizes on each player, and the variance between the players themselves.  If you take it up to the league level, you have a three level model… More variance means larger standard error means wider confidence interval means more likelihood of regression to the mean.  MGL is right on in calling for caution in interpreting groups of players the same way we would interpret single players.  You’re dealing with two completely different statistical models.

You can empirically test, by the way, to see what percentage of the variance is accounted for by team factors and what by individual factors (how much is a team just a collection of individuals?), by using variance decomposition methods.  (It’s on my to-do list...)


#4    MGL      (see all posts) 2007/04/12 (Thu) @ 20:54

Yes, while I am not statistics maven by any stretch, it seems to me that I can computer sim the variance among random groups of players (I’m sure that can be done theoretically, but I don’t know how to do it), say 10 players per group, and then compare that to the actual variance we see among teams (say their 10 most used players).  The difference should be attributed to how teams put together their rosters (payroll issues, philosophies) and how coaching influences performance.

Same thing for total team talent, right?  We can look at the variance expected by chance if all teams were random collections of players (would that be the same as the random variance in 162 game w/l reocrds, assuming all teams were equal going into and throughout the season?) and compare that to actual variance in w/l records for a large number of season.  The difference is the true variance in team talent.

All in all, I think this is an important concept.  I am surprised it is not getting much discussion.  I may expand and clean up my post and turn it into an article for one of the web sites.  I can’t stand when pundits make sweeping statements about teams, bullpens, starting staffs, etc., based on what looks like failry large sample sizes, but really aren’t.  Even sabermetric sites (e.g., BP) do that all the time (OK, maybe not all the time, but occasionally, depending on the individual writer)!


#5    Pizza Cutter      (see all posts) 2007/04/12 (Thu) @ 21:34

The oft-repeated variance(observed) = variance(true) + variance(expected) isn’t exactly right.  It’s actually true variance plus _error_ variance.  Error variance can include measurement error, expected random variation, and just dumb luck.  The assumption is that as your sample gets bigger, an unbiased measure would lead to a measurement error approaching zero.  Since there’s no accounting for dumb luck, we assume that if the sample is big enough, all that’s left of consequence is the expected variance, which can be calculated and subtracted that true variance may be isolated.

You can test the assumption that teams are random collection of players (or watch the Cardinals play the Nationals and figure out that they’re not).  You could find the variance in team talent and compare it to the variance of players nested within those teams (and to the error term) to see whether it’s the team-level effects or the player-level effects that drive the findings that you have.  If player-level variables are driving it, then if you’re calculating the regression of a bullpen, you need to do it as a function of each of the players’ regressions.  If team-level variables drive it, a more standard RTM will suffice.  My guess is that player-level variables will dominate.  This can all be done using GLM procedures and variance decomposition.  It sounds like you’re not up to speed on this particular type of analysis, but if you’d like to be, start looking into hierarchical modeling.  It’ll make for some nice light reading before bed!


#6    MGL      (see all posts) 2007/04/12 (Thu) @ 23:00

Thanks PC!  It’s not like I don’t have enough things on my to do list!


#7    Mike Green      (see all posts) 2007/04/13 (Fri) @ 12:30

This does not have much to do with the main point, but bullpen ERA, unlike starter ERA, has only a passing acquaintance with true talent.  You would notice a very significant difference in the rate of regression between bullpen ERA vs. bullpen FIP from the first half to the second. The rates for starter ERA and starter FIP would be, I expect, much more comparable.


#8    MGL      (see all posts) 2007/04/13 (Fri) @ 16:10

Mike, that is good comment but I am not sure you are correct.  Certainly in the long run, bullpen ERA and FIP will converge just like starter ERA and FIP.  In the shorter run, I don’t think you’ll find much difference either.  I can check though.  In general, the whole idea that ERA for a reliever is not that indicative of talent (and you have to use something like strand rate) is ridiculous and simply not true.


#9    Chris Miller      (see all posts) 2007/04/13 (Fri) @ 17:40

Wouldn’t a relieiver just need to be regressed more heavily than a starter because they pitch less innings/batters, or do would you regress them differently?  I guess I could look for myself.


#10    John Beamer      (see all posts) 2007/04/13 (Fri) @ 17:56

Chris. You have to choose to which baseline you regress to. Selection bias would dictate that a starter is generally a better pitcher than a reliever ... however, The Book showed that because of workload and usage relievers can have an ERA advantage of about 1 run over starters.

What I’m saying is that given the right info you’d regress starters and relievers to different means.


#11    MGL      (see all posts) 2007/04/13 (Fri) @ 18:53

I looked at all relievers and starters from 99 to 06 and did a year to year correlation.  Each pitcher must have had at least 50 IP in each of the paired years.  I paired 99 with 00, 01 with 01, etc.  I had 820 pairs of starters and 494 pairs of relievers.  Starters has at least 80% of their games as starters and relievers had at least 80% of their games as relievers.

The y-t-y r’s were as follows:

Starters avg of 172 IP per year

.292 for ERA
.533 for FIP

Relievers avg of 72 IP per year

.238 for ERA
.489 for FIP

Since the correlations are directly related to the number of IP per year of course, in order to compare apples to oranges, we need to adjust those r’s for the average number of IP per year in each group.  After doing that, here is what we get (adjusting the relief pitcher’s r’s as if they also pitched 172 IP per year):

Starters avg of 172 IP per year

.292 for ERA
.533 for FIP

Relievers equivalent of 172 IP per year

.428 for ERA
.694 for FIP

So it seems as if relievers actually have a much lower regression to the mean given the same number of PA.  I’m not sure why that would be the case.  Maybe it is because starters get injured more often due to a higher number if IP thrown.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Sep 07 10:55
Fans Scouting Report: Update

Sep 08 08:52
Changing bats

Sep 07 21:56
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 07 19:09
Relegation in MLB

Sep 07 18:36
I agree

Sep 07 18:02
Fan ownership?  Best team in the league

Sep 07 17:06
Is WAR useful for Awards voting?

Sep 07 16:27
This Week in Silly Baseballness

Sep 07 11:18
WOWY Teachers

Sep 06 23:15
This week in golf insanity