THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, August 04, 2008

Here is a great article on “regression toward the mean”

By , 01:05 AM

Although I don’t think they once mentioned that term.  Anyway, this article was brought to my attention by one of our faithful readers, Eli Witus.  It is about the origins of the statistical concept of regression to the mean, which is apparently called Stein’s Paradox, more or less.  I had never heard of that.

What impressed me after reading the article was how inexact a science it is, in a manner of speaking, and how recently the concept was embraced (the 1950’s and 1960’s) and has evolved.  I assumed that it had been around for a hundred or more years.

BTW, Eli has a very nice web site/blog for anyone interested in NBA sabermetrics (called APBRmetrics).  Also, he now works for an NBA team, congratulations to him.


#1          (see all posts) 2008/08/04 (Mon) @ 09:08

Very cool!  Do we know what publication that’s from?

I agree completely that you’d think the concept had been around much longer.


#2    Eli      (see all posts) 2008/08/04 (Mon) @ 09:21

It’s from Scientific American in 1977:

http://www-stat.stanford.edu/~ckirby/brad/misc/

I’ve usually seen Galton credited as coming up with regression toward the mean in the late 19th century. I don’t know enough about the history of statistics to know whether Stein’s Paradox was (1) a reinvention of the wheel, (2) a similar but importantly distinct concept, or (3) an advancement on RTM in that it was the first time the concept was formalized mathematically.


#3    Tangotiger      (see all posts) 2008/08/04 (Mon) @ 11:00

Carl Morris, the co-author, has been referenced in either this blog, or my old one at Primer.  He’s done Markov work and 24 base/out states stuff.


#4    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 11:51

I do like the mental image of W. James being a precocious 12 year old Bill James helping out the elder statistician Stein with the baseball statistics and getting his name on a scientific paper. 

On a more serious note, the term “regression to the mean” as used in real statistics has nothing to do with how it is used in baseball statistics.  It is a completely separate concept that existed before Stein’s paradox and that is why you don’t see that particular phrase used in the article and why the two ideas have separate lineages.

This is probably the proper time and place to propose this.  It is never correct to regress to the league average when trying to project an individual player.  In fact, it is never correct to regress to the average of any group when trying to project an individual player.  When sabermetricians are using “regression to the mean” in this way, what they are really trying to do is make a small sample adjustment between a single year’s data (or any other relatively small sample) and the total amount of information that has been accumulated on the player, usually his career stats or a recent large subset of his career stats.  The correct way adjust for a single year’s stats is to regress to the analyst’s previous best projection of the player’s “true value”.  The amount to regress should not be based on a specific number derived from the standard deviation of the population, but is dependent on the confidence that the analyst has in his prior projection of the player’s “true value”.  This confidence will vary depending on how much previous data is available on the player and where the player is on the arc of his aging curve.

Heresy, I know, but completely testable.


#5    Tangotiger      (see all posts) 2008/08/04 (Mon) @ 12:35

Please do so.

Because where I come from, it is totally acceptable (and required) to regress a player to the mean of the population he was drawn from.


#6    Sky      (see all posts) 2008/08/04 (Mon) @ 12:43

Peter, wouldn’t our best-guess at a player’s true talent pretty much be league-average?  Or maybe you could regress towards the subset of hitters that are “large” and “slow” and “over 32 years old”?

That is, isn’t league-average a decent stand-in for our best guess at a player’s true talent level?


#7    Eli      (see all posts) 2008/08/04 (Mon) @ 12:48

I don’t think that’s heresy, or even in conflict with the way RTM is used in The Book or on this blog. The standard deviation of the population in a particular stat is used to help estimate how much confidence we have in the player’s past statistical record (we also use the player’s number of plate appearances to estimate the confidence). I don’t see anything wrong with using the spread of talent among similar players to judge how much trust should be put into a player’s stats.

But maybe I’m misunderstanding something.


#8    MGL      (see all posts) 2008/08/04 (Mon) @ 12:54

Peter, I don’t think it is heresy, only that it is completely wrong (what you say).

What we do is exactly Stein’s paradox.


#9          (see all posts) 2008/08/04 (Mon) @ 13:17

Speaking of this… I was just on the Freakonomics blog, and the last graph here caught my eye: http://freakonomics.blogs.nytimes.com/2008/08/04/happiness-inequality-1-the-facts/

Is that not simply an example of regression to the mean?  Can we draw any other inferences at all from that final graph?


#10          (see all posts) 2008/08/04 (Mon) @ 13:39

When you regress, you want to regress based on your best guess of the distribution the sample is drawn from.

That means that if A-Rod goes 0-for-20 to start the season, your estimate of his actual talent will be higher than if John McDonald goes 0-for-20.

If you know absolutely nothing about the player, then you use the entire population of MLB as your best guess, and you regress to the MLB mean.  But if you know something about him, you should regress to your best guess about that particular player.

But isn’t this already what we do?  For instance, one common way of regressing is to take X% of the guy’s last year’s stats, Y% of his this year’s stats, and 100-X-Y% of the league average.

And regression is not always to the mean.  Suppose you have two groups of players.  One group all has talent of .100, and the other all has talent of .300.  The mean is .200.  But if you find a player hitting .280, you will regress him AWAY from the mean, to .300.

You “regress” based on the distribution, not the mean.  We just called it the mean because most of the time the distribution is symmetrical and bell-shaped.


#11    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 14:23

Sky - The league average is never the best information about the population that best represents a player, even a rookie.  A player’s fielding position, lineup position, MLE, or weight are going to be more accurate than the league average.  Why would you regress a power hitting first baseman with an MLE of .327 that is going to be a starter batting fourth in your lineup to the same league average as a slick fielding shortstop with an MLE of .290 who is brought up to be a fill in infield replacement that will bat seventh?  It doesn’t make sense.  Nor is regressing to populations other than the entire league a novel concept.  It is mentioned by specifically in “The Book” on page 49, “We note that when we say we are regressing “toward the mean”, we mean the mean of all similar players, and not necessarilly the mean of all major leaguers.” It is also what is being discussed on page 126 of the article on Stein’s paradox where the author’s suggest that the group of cities east of the Lempos might have a genuinely lower rate of toxoplasmosis, i.e. represent a subgroup of the population of the entire country, and thus Stein’s paradox should not be applied.

The heresy is suggesting that a player’s previous projection is the best number to regress to.  But why should that be an unusual idea.  The purpose of finding the best subgroup to regress to is to find the group that best represents all that we already know about a player.  But all that we already know about a player should already have been in the player’s previous projection.  All I am suggesting is that this be used as the player’s “best subgroup” even if it is a subgroup of one.

MGL - On pages 125-126 of the article on Stein’s paradox it states “As we have seen, the James-Stein method gives better estimates for a majority of cities, and it reduces the total error of estimation for the sum of all cities. It cannot be demonstrated, however that Stein’s method is superior for any particular city; in fact, the James-Stein prediction can be substantially worse.” So what makes you think that using Stein’s method (regression to the mean) is the correct thing to do to get the best projection for an individual player?


#12    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 14:35

Phil - We never know absolutely nothing about a player who is being brought up to the major leagues; see my reply to Sky in post #11.  The first three paragraphs of your post had everything else right, particularly the last sentence of paragraph 3.  All I am arguing is that the “best guess about that particular player” is our previous projection of a player’s true talent and that is what we should regress the current year’s data toward.  Forget entirely about the league average as it adds nothing to the projection because is not a better representation of the player’s true talent.


#13    David Gassko      (see all posts) 2008/08/04 (Mon) @ 15:14

What Peter is saying is actually the general consensus on the blog; I think he’s just phrasing it incorrectly (or perhaps a bit too militantly). Of course you don’t just regress a player to the mean; you regress him to his own mean, which can incorporate size, age, position, scouting information, etc. That’s exactly what Tango is talking about when he says we want to regress UZR to what the Fan Scouting Report tells us!

As for, “the ‘best guess about that particular player’ is our previous projection of a player’s true talent and that is what we should regress the current year’s data toward,” that’s something Mickey likes to talk about all the time: It’s called Bayesian math, and it’s what Nate Silver tried to do in his Chipper Jones article. There are two things here, though: (1) That projection that Peter is talking about isn’t just pulled out of thin air, rather it’s exactly what I was talking about in my first paragraph, and (2) The Bayesian math is a much more complicated way of getting almost the same exact answer that regressing to the mean will give you. It’s more correct, but the cost:benefit ratio is just way too great.


#14    Tangotiger      (see all posts) 2008/08/04 (Mon) @ 15:16

The previous forecast would have included the league mean as well.

I agree with everything that Peter is quoting here.  For example, you don’t regress Randy Johnson’s OBP toward .340, but toward whatever a pitcher as a batter is (say .180).  And if you have a tall lanky hitter who spends 99% of his training as a pitcher, you use that information as well.

And yes, an estimate can be worse for an individual item.  It would have to be, since everything is unique, and we are not uniquely identifying all the characteristics of those items.  I’m sure there’s something unique about Lyle Lovett that we are not identifying that makes us miss that he scores well with women, for example.

I guess I’m just not getting the one thing that we are disagreeing, since I seem to agree with 100% of Peter’s statements of facts, but 0% with his conclusion.  Peter, can you give a real example here?


#15    Eli      (see all posts) 2008/08/04 (Mon) @ 15:34

The controversial part of what Peter’s saying seems to be that the reliability of a stat (which is a product of binomial randomness and the spread of talent in some population) should not be included as a factor in regressing to the mean. But it seems to me like it has to come into play at some point.

One of Peter’s points might be that for a player who has been in the majors for several seasons, we have enough information about him in particular so that it’s no (or very little) added help to look to some population mean in making a projection. That’s probably right.

But what about rookies, who we don’t have much data on? Or what about statistics that take a long time to build up their sample sizes and/or for which there is little spread of talent (like clutch hitting)? In these cases it seems that inferring things about the player based on other similar players is essential.


#16    Eli      (see all posts) 2008/08/04 (Mon) @ 15:53

I think this could be part of the confusion -

Peter suggests regressing toward the “previous projection of a player’s true talent”. But regression to the mean IS how we estimate a player’s true talent over some period of time.

Projecting a player going forward is a little different. It starts with the estimate of true talent we got from RTM. Sometimes this is modified by weighting older seasons less than more recent seasons (which means we’re no longer exactly estimating true talent over a time period, but rather trying to also roughly capture any changes in true talent). And then we add on an aging curve.


#17    Sky      (see all posts) 2008/08/04 (Mon) @ 16:20

Regressing towards an MLE seems to make a lot of sense.  If that data was readily available, you might even say Marcel would buy into it.

The other information MLB teams would have on every player is a scouting report.  They should regress towards that.

I believe Justin over at On The Reds is regressing 2008 fielding data towards Tango’s 2007 fans scouting report information.


#18    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 16:23

Even a rookie has characteristics that differentiate him from other rookies and the average major league player.  That’s why I chose the descriptions of the players I did in my answer to Sky in post #11.  There is never any reason to regress even a rookie player to the league average.  You can always find a more relevant subgroup or combination of subgroups to regress toward.  And converse to expectations, you may not to regress a rookie’s first full year’s stats as much toward his relevant subgroups, as his actual performance at the major league level even in a small sample size may be more indicative of his true talent than his previous projection.

I said that I believe my method is the theoretically correct way to project a player.  Whether the improvement gains are worth the effort is another question which depends on how you are using the data.  If you are doing a study using the whole major league population and the results of the study only depend on having the least error for the population as a whole, then as the article tells us, Stein’s method will give you the correct answer.  Let the monkey do his thing. But if you are projecting individual players where the projections are to be used in the business of baseball a run error in the projection means a $450,000 error at the current estimated value of $4.5 million a win.  So there is plenty of incentive to do the extra work to get the theoretically best estimate of a player’s true value.


#19    Tangotiger      (see all posts) 2008/08/04 (Mon) @ 16:37

I don’t think I disagree with anything Peter is saying, other than my initial confusion.

You regress a player toward the mean of whatever population he is drawn from, but that population does not have to be comprised of everyone.  There are different universes of populations, such as lefthanded wily veterans, or young hotheaded outfielders, or whatnot.

So, I think we’re on the same page Peter, or at least the same chapter.


#20    Eli      (see all posts) 2008/08/04 (Mon) @ 16:43

If the point is just that regressing to some subpopulation of the league is better than the whole league, that’s something that everyone here already agrees with. The Book talked about it, there have been posts and discussions about it, etc. The league average is better than not regressing to anything at all, and in some cases it is difficult to come up with a better comparison population (like in the case of more obscure stats like clutch hitting, platoon splits, etc.). And since choosing a population to regress to is subjective, it can make things worse than if you stuck with the league average. But your overall idea doesn’t seem to me to conflict with what was said in The Book or in posts on the blog relating to RTM. Do you have any examples where you think something was done incorrectly?


#21    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 17:25

Tango and Eli - Where we might not be on the same page is that I am specifically suggesting that from at least year three onward that a player’s performance for the year only be regressed toward the pre-season projection of the player’s true value.  That the pre-season projection will better represent the player than any other subgroup.  Since nobody that I know of currently does that, I think all current individual player projections are done incorrectly.


#22    Colin Wyers      (see all posts) 2008/08/04 (Mon) @ 17:31

Peter - So, let me see if I understand correctly.

For player with X number of career PAs, Season Y performance should be regressed to the weighted average of his performance in seasons Y-1 through Y-3, with no regression to the mean. Is this correct?

If so, provide whatever value of X you like and I’ll go ahead and run the projections.


#23    Eli      (see all posts) 2008/08/04 (Mon) @ 17:40

I actually kind of agree with Peter in that I always wondered how useful the RTM component was in Marcel projections, especially for players who had been in the league for a long time. I don’t think it has to be all or nothing though. The more plate appearances you have, the less the population mean should factor into the projection. Though the Marcel structure kind of obscures that by only counting the last three seasons even if a player has been in the league much longer.

The non-Marcel context is where I think RTM is really valuable - like with the clutch hitting stuff in The Book.


#24    Sky      (see all posts) 2008/08/04 (Mon) @ 18:10

Peter, what would you regress a player’s stats towards in the first few years?  League average?  Personal MLE history?  Other player subset?

And I assume you’d use a plate-appearance cut-off rather than a years cut-off right?  And the more plate appearances a player has, he’s being regressed less towards the mean, right?

Marcel currently regresses 2/14 towards the mean for full-time, three year players, right?  (5/4/3/2 weighting.) So if you’re regressing towards that, aren’t you still regressing somewhat towards the league mean?


#25    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 18:32

Colin - I never said and don’t believe that the best projection of an experienced player’s true talent is a simple weighted average of his previous 3 major league seasons.  You still need to age adjust, and as the thread on age adjustment shows there is no consensus as to how to do that. How many years back into a player’s career you should go and how much you should weight each year has never been tested as far as I know. And how much weight you should give to year Y stats and how much to the preseason projection definitely depends on where a player is in his aging cycle.

But if you want a rough test you should be able to use any current projection system that regresses to the league average.  Take that systems projections for the 2005 season.  Regress each player’s 2005 stats back to his 2005 projection.  Regress the same amount as whatever projection system you are using usually regresses to the league average.  Age adjust and year weight the same as your projection system you have chosen. Use the results as your projection for 2006 and go through the same process for that year’s stats.  That will give you a projection for 2007.  Compare the difference of the actual 2007 stats for each player with both the preseason projection by this method and the 2007 preseason projection of the projection system you have chosen.  Sounds like a lot of work to me, but if you want to try it be my guest.  I should have my own player projection system ready by next season and I will make my predictions public for all to see.


#26    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 18:49

Sky - Rookies and other low PA players still never get regressed to the league mean.  What subset or subsets to use need to be determined by testing.  All the subsets I mentioned in answer to you in post 11 are possibilities.  How much to regress also needs to be tested.  As I said in post #4 the amount should be based on the confidence you have in your preseason projection.  The more confidence you have the more you would regress.  The factors that would affect your confidence are the number of PA’s of prior experience, the number of PA’s in the year’s sample, the variation of the stat you are predicting, any injury history, and where the player is on his aging cycle.  Any of these factors can be tested.


#27    MGL      (see all posts) 2008/08/04 (Mon) @ 19:06

Tango and Eli - Where we might not be on the same page is that I am specifically suggesting that from at least year three onward that a player’s performance for the year only be regressed toward the pre-season projection of the player’s true value.

That is completely wrong, Peter, and that is not what is meant when most people talk about regression toward the mean, although is is a similar concept.

First of all, as someone already mentioned, a player’s projection at any point in time already includes a regression toward the mean of the population that we think the player comes from (and we need to make sure that this “population” has nothing to do with the stats of the player we are trying to project).  Second of all, there is no amount of data on a player (or any other randomly drawn sample from a population) that does not get regressed.  It is certainly not “a few years of major league service” and it is not 100 years of major league service.  However, if you do some kind of “regression to a player’s previous projection,” as I already said, that INCLUDES a regression to the mean of a certain population.  And in any case, that is NOT what we mean by regressing a sample performance toward the mean!  We don’t mean towards the mean of a previous estimate.  That makes no sense.  None whatsoever.  Let’s say that we have a player who projects as a .270 hitter going into some time period and that is based on like 100 AB of .320 hitting (which is the regressed towards a mean of .250.  Now he hits .260 for another 600 AB.  We don’t regress the .260 towards the .270 because that was his previous projection!  That would be ridiculous.  We take all of his raw stats, the 100 plus the 600, do any kind of weighting or age or context adjusting we want, and then regress towards .250 again.

So can we please stop saying that a player’s projection or his true talent estimate gets “regressed towards his own mean or towards some prior projection!” That makes no sense and is completely wrong as you can see from the above example.

Let’s stop being solicitous.  Peter is a great guy and real smart, but he has this completely screwed up.


#28    MGL      (see all posts) 2008/08/04 (Mon) @ 19:16

And it has already been said 5 or more times in this one thread, but a player’s stats get regressed toward the mean of whatever population we think he comes from. Period. It is no more or less complicated than that.  We can legitimately argue all day about exactly what population we think a player comes from, and what the mean of that population is, but the description/explanation (whatever population we think the player comes from) is 100% correct.  And so is the methodology.  And as the article says and I have been saying for a long time, regression toward the mean is a shortcut for A two-part Bayesian probability exercise where the a priori distribution is normal.  If it is not, then using a regression toward the mean formula like that in the article and that which we generally use in baseball is merely an approximation.  For example, if the mean of all baseball players’ BA is .250 and it is normally distributed then the “shrinkage formula” once we know the SD of that distribution works just fine.  However, if that distribution consists of 99 players who hit .249 and one player who hits .349, then I don’t think that the shrinkage formula is exactly the same as doing the rigorous Bayesian version, which is, “Given that 99% of all players are true .249 players and 1% of all players are true .349 players, and given that our player hit .300 in 38 AB, what is the chance that he is a .249 players and what is the chance that he is a .349 player?” and then we take the weighted average, based on those 2 probabilities, of .249 and .349.  That is the 100% correct method of estimating a true BA from a sample one (when we know the exact distribution of true BA in the population of all players).  I don’t think that the “shortcut” RTM formula (the shrinkage one) will work, since it assumes a normal distribution of talent in the population, although it will be close (I think).


#29    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 19:34

With all due respect MGL I don’t think you know what you are talking about.  You obviously don’t think I do either.  I think you are a smart guy as well.  But whether you or I is the smarter guy on this subject is completely testable.  Next year you make your best preseason player projections for offensive players (and pitchers if you want) and so will I.  The person who has the predictions with the least total error for whatever comprehensive rate statistic you choose for the 2009 season wins.  I get to choose the PA cutoff point.  Deal?


#30    Rally      (see all posts) 2008/08/04 (Mon) @ 21:04

I look forward to seeing your projections Peter.

Actually, I you can project players either way and it won’t make a difference as long as you are using proper weighting.  A player’s initial projection is going to include some regression to the mean, so once you start regressing to that projection, you are still regressing to the mean.

It gets a lot more complicated though, after a few seasons it would be hard to tell how much regression is still left and how much of the player sample is there.  MGL’s way is a little more straightforward.

For my projections I regress BABIP to the player’s speed score, and HR% to the player’s weight.  I haven’t found any stand-in for walks and strikeouts, so they get regressed to the straight league average, but those stats are regressed relatively little anyway.

I also regress minor leaguers based on the level they play at.


#31    Peter Jensen      (see all posts) 2008/08/04 (Mon) @ 21:37

Rally - A player’s initial projection does not include any regression to the league mean because you NEVER use the league mean in any projection.  You always have a better subgroup to regress toward in a player’s first couple of years.  From then on you regress only to the player’s pre-season projection for every stat.  Eventually your projections are based entirely on your pre-season projection, the current year’s stats regressed to that projection, and an age adjustment.


#32    MGL      (see all posts) 2008/08/04 (Mon) @ 23:27

Peter, A, I’ll take you up on that for any amount of money you want, although that has NO relevance to what we are talking about.  For one thing, ANY two decent projections are going to be decided by luck a good proportion of the time.  And heck, I may just use Rally’s projections if you don’t mind!

I may have to take back what I said about your being smart wink as Rally and about a hundred (well, not quite) other people have said over and over again, that NO ONE is blindly regressing any player toward a league mean (not counting a “basic Marcel” perhaps).

Rally just got done saying that he regresses a player’s BABIP according to his speed score, etc.  I am sure he uses all kinds of population means to regress towards, as does EVERYONE who does serious projections.

Why do you keep insisting on arguing about something which does not exist (someone in this thread arguing that they would blindly regress a player’s stats toward the major league (or NL or AL mean) without regard to the “type” of player they are projecting)?

you NEVER use the league mean in any projection

In any case, that is completely wrong, as you would regress a player toward a league mean if you knew nothing about a player other than he comes from that league!  If you disagree with that, I will offer to wager, up to, say, $100,000 and we can ask Andy or one of the other pure statisticians decide.

You always have a better subgroup to regress toward in a player’s first couple of years.

This statement implies that after a couple of years, you do not regress a player’s sample stats toward a mean of the population you think the player comes from.  Of course you ALWAYS do that, no matter how many years of stats you have on a player.  It is possible that your estimate of what population that player comes from changes, of course, and you CANNOT use any of those sample stats that you are using for a projection influence the population choice!  Again, the initial projection may already include that - I am not sure about that.

If you disagree with any of the statements in the above paragraph, I’ll wager another $100,000 if you like.

Now, whether regressing a player’s sample stats toward his prior projection is the same thing as regressing his sample stats toward some population mean and that’s it (other than weighting and age adjustments of course), I am not sure. I am sure that just taking the sample stats and then regressing toward some population mean is the EXACT, 100% CORRECT way to do a projection (again not including whatever may change a player’s true talent over time - we are assuming for this argument that all sample stats from a player are an unbiased estimate of his true talent, no matter what the time frame).  If you disagree with THAT, I’ll go another 100 grand!

So, I really have no idea what you are talking about in terms of “regressing a player’s sample stats toward his prior projection,” but I suppose that could work, although it doesn’t make much sense to me, and I don’t think it makes much sense to any one of our resident experts here.  You may be the smartest person on this blog!

In any case, you’ll have to give a simple example of what you are talking about for me to see whether we are talking about the same thing, and that therefore, there would be NO argument.

I can give you some simple data and you can simply tell me how to do a simple projection with no weighting or age adjustments, etc.  A player from a league bats .300 in 300 AB.  We know nothing about him. The league mean BA is .250.  The SD of true talent BA in that league is .30 points.  What is your estimate of his true BA after those 300 AB?  Very simple to do.

Now he bats another 200 times and he bats .200.  Again, no weighting for recency or anything like that.  We are assuming that his true talent is constant and fixed.  What is your estimate of his true BA now?  To me, that is a simple problem and HAS TO BE EXACTLY the same as if I said that he batted 500 times with a BA of .230 - what is your estimate of his true BA? And that requires simply taking the .230 BA and regressing toward the league mean of .250, based on those 500 AB. If you disagree with that last sentence, I’ll go with a smooth million (or more) for a bet!  Again, whether taking the initial projection after the initial 300 AB and then regressing the next 200 AB at a .200 clip towards that works out to the same thing, I don’t know.  If it does, then we have no disagreement. If it doesn’t, then you are wrong, because just taking the .230 BA in 500 total AB and regressing toward .250 is THE (maybe not the only) way to do it (remember I have one million dollars riding in that, if you should choose to accept)!

Anyway, go ahead and answer that please using your, “regressing toward a player’s prior projection” thing…


#33    MGL      (see all posts) 2008/08/04 (Mon) @ 23:53

BTW, if anyone every offers me a wager about something baseball related, I am going to do one of two things:  Gladly accept, or admit that I am wrong or that the person is “better” if it is a skill-type wager, like the projection thing sort of (mildly) is.  I have to play by my own rules.


#34    dq      (see all posts) 2008/08/05 (Tue) @ 00:12

A player batting .300 in 300 AB and then .200 in 200 AB does not “HAVE TO BE EXACTLY the same as ....500 times with a BA of .230”

Do you want to bet your $100,000 on this one?


#35          (see all posts) 2008/08/05 (Tue) @ 00:53

I did it again with the bold this time!  As I said, I either take the bet or admit that I screwed up!  My arithmetic skills are not what they used to be!


#36    MGL      (see all posts) 2008/08/05 (Tue) @ 04:16

I am not sure, but I think that both “methods” (they are really the same method) come out exactly the same, which means that there is no argument at all.

For example, let’s say that a player hits .200 in 800 AB and that the regression for 800 AB is 50% (which it is approx.).  So he is a true .225 hitter if the population he comes from has a mean of .250.

Now let’s say that he hits .360 in another 800 AB.  His BA is now .280 in 1600 AB, which requires a regression of 1/3, which makes him a .270 hitter.

If we regress his .360 toward the prior projection of .225, we would have to regress 2/3 of the way, which happens to be 1- 1/3 of course. 

Let’s see if this works for another example:

.240 in 800 AB.  He is a .245 hitter.  Another 400 AB of .300 hitting and we have a total of .260 in 1200 AB.  Regress that 40% and we have a .256 hitter.

Using Peter’s method, we have .300 and we regress towards .245.  To get .256, we need a regression of 80%.  Nope that does not work.

While our regression (how much to regress) is solely a function of the total number of PA, his regression is a function of the number of PA that went into the original projection and the number of PA in the new sample.  But I don’t know what that function looks like off the top of my head.

Maybe Peter can provide that.

In this case, I am using the formula 800/(PA+800) to determine the regression amount, which is around what it should be for BA.

Maybe Peter can provide the corresponding formula when we have PA1 and PA2 and we want to regress the BA in PA2 to the projection we got from PA1.

Or maybe we can solve for that.

The formula for a player’s projection in X PA, where the population mean is .250, is:

.250 + (BA1-.250) * 800/(800+PA1)

Now we have another BA, we’ll call BA2 in another sample of PA, we’ll call PA2.

The players projection (using the traditional method) is now:

new BA = (BA1*PA1+BA2*PA2)/(PA1+PA2)

That is just the weighted average of the two samples of BA, or the player’s total BA in both samples combined.

So the new projection is:

.250 + (new BA-.250) * 800/(800+PA1+PA2)

or,

((BA1*PA1+BA2*PA2)/(PA1+PA2)-.250) * 800/(800+PA1+PA2)

Using Peter’s method, our formula for the new projection is:

old projection + (BA2-old projection) * regression amount

or,

(BA1*PA1+BA2*PA2)/(PA1+PA2) + (BA2-(BA1*PA1+BA2*PA2)/(PA1+PA2)*regression amount

If we make our methos and Peter’s equal to each other, we have:

((BA1*PA1+BA2*PA2)/(PA1+PA2)-.250) * 800/(800+PA1+PA2) = (BA1*PA1+BA2*PA2)/(PA1+PA2) + (BA2-(BA1*PA1+BA2*PA2)/(PA1+PA2)*regression amount

Solving for Peter’s regression amount…

((BA1*PA1+BA2*PA2)/(PA1+PA2)-.250) * 800/(800+PA1+PA2)-(BA1*PA1+BA2*PA2)/(PA1+PA2)

divided by

(BA2-(BA1*PA1+BA2*PA2)/(PA1+PA2)

Whatever the hell that reduces to, or we may have to use simultaneous equations to solve for Peter’s regression formula to get rid of the BA1 and BA2 terms since it should be only a function of PA1 and PA2.


#37    Peter Jensen      (see all posts) 2008/08/05 (Tue) @ 06:20

I believe that the second example works as well. The new projection of .245 represents 1600 PAs, the 800 actual PAs and the 800 added PAs from the league average to get the new projection.  The new sample is 400 PAs at .300 BA. As you correctly calculated you need to regress 80% to get to the final .256 BA.  The total PAs is 2000 so you are regressing 1600/2000 or 80%.  I believe it works in all cases but I don’t have a proof for it.


#38    MGL      (see all posts) 2008/08/05 (Tue) @ 11:38

OK, I’ll play around with some numbers to see if that works in all cases.


#39    Pizza Cutter      (see all posts) 2008/08/07 (Thu) @ 00:12

Sorry to be jumping into this one so late, but let me see if I’ve got this right.  There seem to be two issues intertwined here.  One is the issue of picking the right prior to regress to when we regress to the mean.  In the past, that’s been league average.  People seem to be OK with the thought that we can do better than that, only there’s not a lot of consensus on exactly how to do that.  Fine, open discussion.

The other is methodological.  Peter seems to be arguing for some sort of ARMA (auto-regressive moving average) system for projections.  Phil is right in #10 that we intuitively don’t consider an 0-for-20 for John McDonald the same way as an 0-for-20 for A-Rod.  For McDonald, that’s par for his course.  For A-Rod, that’s weird and we know that he’s really a better player than that deep down and we have the numbers to back it up. 

That type of conceptualization is a complete paradigm shift away from the regression (to the mean) and (linear) regression systems that seem to be out there, not so much conceptually, but certainly in terms of statistical methodology.

I’m only superficially familiar with the ARMA/ARIMA concept.  As I understand it, it’s set up to handle time series data (which all baseballs stats are… your OBP after your 342nd PA incorporates the data from the 1st-341st and tacks on another observation… hence a moving average...), but with the understanding that the 1st-341st events will have some correlation with the 342nd (it’s the same person in each case).  (I’ve come across it when reading concerning intra-class correlation.) I don’t understand it much more than that or any of the math behind it, but I think that this is what Peter is talking about.  Anyone else know more on this type of analysis and how it might be useful?


#40          (see all posts) 2008/08/07 (Thu) @ 04:08

Here’s what I am doing

Start with a weighted sum of career professional stats (including MLEs for minor leagues). Older data gets a smaller weighting. If there’s at least 3 years of data available it appears to be a reliable sample.

Next, regress that total to the some concept of a league mean. If you have a full dataset, the efeect of the regression will eb minimal. The variance tests I ran yesterday showed that all levels of weighting, regressed data was better, but in the .5 to .8 range, the difference was tiny. When the historical data is sparse, the regression creates a more accurate projection than just the data itself. This will be helpful with players in the low minors who have less than two full years of data available.

The test was to generate projections for each player after each season in which they played (1998-2008) and then compare that to the actual stats for the one season immediately following the projection.

I hope I have this published to the public correctly now
http://spreadsheets.google.com/ccc?key=pLg_vfW0QCD9LVK3sm2tVLg&hl=en

Next is to calculate projection variances on gthe test data grouped in intervals of sample size, to see how many PAs are needed to get a sample within acceptable error ranges, with and without regression.


#41    MGL      (see all posts) 2008/08/07 (Thu) @ 06:49

Brian, could you explain more that the various labels and numbers mean in your data sheet?

When you say “all levels of weighting,” what do you mean?  What is “in the .5 to .8 range” mean?  Normally we weight each year (or each day) differently.  When you say or write a “weight of .6” I don’t know what that means. Do you mean that each prior year gets a weight of .6 of the subsequent year, so that for 3 years, a “.6 weight” means 1/.6/.36?

Where did you get the MLE’s from?  What players did you use in the samples?

Is there somewhere you provide the exact methodology that created this data sheet?

What do you mean “the variances”.  The sheet seems to contain variances, but I don’t know what you mean by that (obviously I know what “variance” is).


#42    tangotiger      (see all posts) 2008/08/07 (Thu) @ 07:07

Brian, could you also run it at .75?

It seems that .7, .75, .80, somewhere around there is the correct answer.

And, if you can try 200, 240, 280 as well… those 9 combinations I think might pretty much do it.

MGL: part of your answer is in post 12 at this thread:
http://www.insidethebook.com/ee/index.php/site/comments/here_is_an_interesting_question_having_to_do_with_one_of_the_other_threads/#12


#43          (see all posts) 2008/08/07 (Thu) @ 07:29

MGL, I apologize for not being clearer.

The weighting determines how heavily past seasons count in the current projection. A lower value places more emphasis on more recent performance, but decreases the sample size, increasing the effect of regression. A higher value allows more credit for older performances, increasing the sample size, and decreasing the effect of regression.

I created a SQL tool to generate MLEs and then group them into True Talent Level estimates at the end of each season. I then wrote another query to test the projections by comparing them to the next season for each player.

As best I can tell, my MLEs are computed in a manner very similar to what you have stated, by creating matched pairs of major league and minor league data, grouped by player, team, league and level. I also use pbp park factors where available (Retro years for MLB, 2005-2007 for minors). Leagues are regressed to levels and teams are regressed to leagues to compensate for those with small sample sizes. The main framework of the series of queries is in place, but the formulas are being tweaked.

The current season always has a weight of 1 (100% - no discounting).

The weighting formula is w^(date of projection minus season being considered)

If w=1.0 for 2008 projections, 2007 is 1, 2006 is 1, 2005 is 1, etc - just use total carrer numbers

If w=0.8 for 2008 projections, 2007 is .8, 2006 is .64, 2005 is .512, etc

If w=0.2 for 2008 projections, 2007 is .2, 2006 is .04, 2005 is .008, etc

If w=0 for 2008 projections, 2007 is 0, 2006 is 0, 2005 is 0, etc - just use last year


#44          (see all posts) 2008/08/07 (Thu) @ 07:44

These are the RMS values comparing my projections to the seaosn being projected.

This was not meant as a test of my projections, but to show the effects of weighting of past performance and regression to the mean on the accuracy. Once those optimal values are determined, I can look for other things to improve the result.

vBA, vOB, vSA are SDs of BA, OB, SA - built from the the following components
vSDT (singles, doubles, triples) is same as BABIP
vHR HR% of balls hit fair (BIP+HR)
vBB, vSO BB% & SO% per PA

For each yearly projection, took projection minus observed for that year, squared it, multiplied by lower of PAs from obs or exp, summed them, divided by sum of multipliers, then took square root.


#45    MGL      (see all posts) 2008/08/07 (Thu) @ 14:43

Brian, thanks!  Nice work.


#46          (see all posts) 2008/08/08 (Fri) @ 14:29

Per Tango’s suggestion, I went back and reran the variances of the projection erros at various levels of prior season weighting and regression to league mean. The other day was the first time I added regression to the projection formula, aand was a little jerry rigged, so this morning put the PAs of regresion and the prior season weighting into a table, which is then referenced by the queries. So, for example, if I want to turn of regression, I can just put 0 PAs in the table.

Here are the new results
http://spreadsheets.google.com/ccc?key=pLg_vfW0QCD_rOiV7N96CTQ&hl=en#

I’ve decided to use 150 PA of regression, prior season weighted at 0.7.

As expected, HR, BB & SO all minimize in between 100-180 PAs. SDT (BABIP) never minimized, even after 550 PAs, and was just a hair better at 0.75 than 0.70, but ths differences in accuracy, at least when there’s enough prior data available, is very small.

As this thread has suggested, as long as recent prior data is available, the amount of regression has very little effect on the overall accuracy of 4000 some projections. The regression will be most helpful when prior data is not available (less than 800 to 1200 PAs, after weighting). If using MLEs, this will be limited to players in the low minors.


#47    Tangotiger      (see all posts) 2008/08/08 (Fri) @ 15:11

Brian: one extra wrinkle, if you can indulge me.  Since no one believes that the MLE of 2007 will be as informative as someone’s MLB of 2007 (if both players are the same age) to the performance expectation of 2008, we should add an extra “different league” factor.  (For my discussion, NL and ML are the same league, they are really just two different conferences.)

So, can you test by applying an extra weight to MLEs, such as .9 or .8.

This way, it might turn out that MLB performance of 2006 might be as informative as MLE of 2007, for performance in 2008.

***

Secondly, what if you remove MLE altogether?  What do you get?  The reason Marcel puts in 240 PA of regression is because it has zero knowledge of the minors.  So, if you are introducing alot of information, then it would make sense that a forecast system would need to regress less.

***

Finally, what if you only look at players with at most 1000 PA in the previous 3 years.  If it’s the case that for most players (PA at least 1200 in prior 3 years won’t have much of an effect if you use 150 or 200 PA for the regression amount), then we really want to sacrifice a bit of accuracy with them (like, who cares if the forecast is 18.2 or 17.9 HR), if we can minimize the damage to the guys that we truly are blind on.

What’s the regression amount?


#48          (see all posts) 2008/08/08 (Fri) @ 15:28

What I have is a bunch of linked queries. The beauty of it is I can put batting data in the table, click on the projection query, and see the projections. Nothing else to do. The more data in the system, the smarter it becomes.

Anyway, after all the factors are computed, there’s a query that applies them to the batting table. I can put a level="MLB" criteria there, and all the following projections should be absed only on MLB data. MLB is not yet park adjusted, that’s one of the next steps.

The results I posted need to be re-run one line at a time, but it doesn’t take all that long - put the regression PAs and prior season weights in the regression table, then click on the projections variance query, and wait about two minutes.

I should be able to get it later tonight, after the kids get to bed and the Steelers first pre-season game is over.


#49          (see all posts) 2008/08/09 (Sat) @ 23:45

If you remember in my “correlation” thread, minor league one year correlated less to major league the next year than MLB to MLB did.

So you definitely don’t want to treat MLE’s with the same weight as you treat MLB stats.  I don’t in my projections.  I weight them as around .8 times MLB stats no matter what year.


#50          (see all posts) 2008/08/12 (Tue) @ 02:56

mgl #40 - I believe you are correct, and it showed up in my own numbers

RMS BA OB SA BAB HR BB SO
Major .038 .040 .076 .043 .020 .025 .038
Minor .045 .043 .091 .048 .023 .025 .044

I’ll consider your .8 weight for minors, but next I am going to be tweaking the MLE process, regressing leagues to their levels, and teams to their leagues. From 2005 on I have pbp park factors which will be used in place of sampling for team factors, then combined with sampled and regressed league factors to complete the MLE.

However, the question in this thread was do we regress to the league mean, or to the players historical record? Empirically I showed that 150 PAs is best for regressing all components by them the same amount, although I agree with Pizza’s recent article at StatSpeak that each component should have it’s own regression factor, and for individuals it looks like the ones shown in “The Book” agree pretty well with my empirical data.

I also showed that using a weightinf factor ot 0.7 for each previous year was optimal in reducing the rms for each of the components.

Reg Wt BA OB SA BAB HR BB SO
0 0.00 .0384 .0385 .0786 .0414 .0201 .0235 .0386
150 0.00 .0336 .0344 .0696 .0360 .0183 .0215 .0369
0 0.70 .0332 .0336 .0693 .0363 .0179 .0207 .0352
150 0.70 .0314 .0321 .0651 .0336 .0169 .0198 .0347

Mean error is minimized by using a combination of regression to the league mean and use of the player’s prior seasons, recent seasons weighted more heavily.

When I had pitcher’s batting in my table, and regressed to the league mean of position player’s batting, it was obvious that this was the wrong mean to use for that subclass. We know that the average position player hits very differently than the average pitcher. I believe that we should regress to values that fit what we know so far about the player. In my study, where the player’s prior record includes minor league data, if we know that the player is a 20 year old in High A, then we should regress to the historical record of all other matching players. This will mateer most when a player’s record is sparse, for as his PAs mount, the less the effect of the regression.


#51    tangotiger      (see all posts) 2008/08/12 (Tue) @ 07:11

Yes, you should include as many unique characteristics as possible when doing your regression.  Include position, ht, wt, age, handedness, GB/FB, etc.

Also, if .7 and .8 are very close in terms of RMSE, I’d go with .8.  This reduces the amount of regression you are going to do, since you are going to add 150 PA whether you count the prior year at 70% or 80%.


#52    MGL      (see all posts) 2008/08/12 (Tue) @ 13:59

However, the question in this thread was do we regress to the league mean, or to the players historical record?

Again, there is no such thing as regressing to a player’s historical record rather than the league mean. That makes no sense.

The discussion was whether to regress to a player’s prior projection or to the league (population from which the player comes) mean and it turned out that it is the same thing in different clothes and it doesn’t matter which one you use (as long as you use the proper formulas for each method)- you will come up with exactly the same answer.  It is like asking which method is correct for multiplying 100 times (3+2)?  Multiplying 100 times 5 or multiplying 100 times 3 and then 100 times 2 and then adding them together?

Imagine a player who bats .350 in one season.  And then he bats .300 in the next season.  Why in the world would you regress the .300 toward the .350?  Not only does that not make any sense, but it is dead wrong of course.

We don’t just regress because that is what someone told us to do and we have to wait for someone to tell us what number to regress towards.  We regress because it is a short cut for the Bayesian probability of a player being hitting something in a sample of performance but coming from a population where we know the distribution of true talent (and that distribution is fairly normal).

Can we please stop saying that we regress a player towards his prior performance!  Sometimes (most of the time) it works out that way, but that is just an accident because a player’s prior performance (or past or any other time period) tends to already be regressed toward league average.  If a player hits .320 in one year, his prior BA will tend to be .290 (or whatever) if the league BA is less than that.  But, as I already said, if he happened to hit .350 in any prior sample, we certainly don’t want to regress the .320 towards .350 to estimate his true batting average or his BA going forward (a projection).

So the idea that you see thrown around in articles by pseudo-analysts and sometimes real analysts that a player will “regress toward his historical norms” is technically correct most of the time by accident, but a bastardization of what is really going on, which is that a player will tend towards whatever his current projection is.  And his current projection is always his total historical weighted stats plus a regression towards some population mean.

If a player hits .190 in his first year in the majors (let’s assume we have no MLE’s) in 100 PA, and then hits .210 in his next year (in 300 PA), does anyone think that he will regress toward the .190 or even towards the .205 (his total career BA)?  Of course not!


#53    Tangotiger      (see all posts) 2008/08/12 (Tue) @ 14:25

Ditto what MGL/52 said, minus all the adjectives and exclamation points.


#54          (see all posts) 2008/08/12 (Tue) @ 15:24

mgl, I hope that was general rant and not directed at my comments in #50. I had restated the original argument i this thread, and then given my conclusion that we should identify what we know about the player, identify other players who match that description, an regress towards that total, and that it really only alters the final numbers a large degree when we have very few stats available for the player in question. I offered age and level of play as examples. Tango offered position, ht, wt, age, GB/FB, etc. as other characteristics.

I agree you don’t wnat to regress the projection to the same numbers the projection was built upon.


#55    MGL      (see all posts) 2008/08/12 (Tue) @ 21:05

It was only directed at anyone who is suggesting that:

...we regress...to a player’s historical record...

Which, as I said, you often see on the web in various and sundry articles.

I can’t believe that we had a 54-post thread about nothing that was ever in question in the first place!


#56          (see all posts) 2008/08/12 (Tue) @ 23:28

It was a very good article on regression. By the time I was able to finish reading it, there were about 32 more responses. Seems that to use Bayes, which is over 200 years old, you have to know the prior dirstribution. This Stein guy showed much more recently that he could regress without prior distributions. But I think that in baseball, we do have prior info about the population, but it helps if we can identify a sub-class of that population which fits best to our existing knowledge of non-statistical characteristics of the player being studied.


#57    Peter Jensen      (see all posts) 2008/08/13 (Wed) @ 03:05

"I can’t believe that we had a 54-post thread about nothing that was ever in question in the first place!”

MGL - The fact that in May you posted the “Amount of Regression” thread shows that this is not a completely settled matter.  Will’s post #7 in that thread was absolutely correct and is exactly the methodology that you ended up using in your post #36 in this thread for the first example.  But you didn’t get it to work in the second example in post #36, which to me means that you didn’t understand what you had done in the first example or why it had worked.  My subsequent explanation in post #37 was exactly the same as Will’s reasoning in post #7 in the other thread, and if you are now convinced that this is the correct methodology, then it seems that you have made progress since May and the length of this thread may have been worthwhile.



Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Nov 20 01:43
Sabermetric Moves of the 2009 Pre-Season

Nov 20 09:12
David G. checks in again on whether experience matters in the post-season

Nov 20 04:02
Nate Silver: hero to interviewers

Nov 20 02:01
My 1B is better than your 1B

Nov 20 00:26
MLB logo

Nov 19 23:03
NBA’s Marcel

Nov 19 19:13
Offense by position groups by decade

Nov 19 17:32
Changes in home run rates during the Retrosheet years

Nov 19 16:40
One Year and One Million Hits Later

Nov 19 16:22
Soria as a starter?