THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
Mailbag:You ask:We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, May 09, 2008

Amount of regression: Here is an interesting question having to do with one of the other threads…

By , 03:27 AM

Maybe Tango and some of the other stat guys can take a stab at this.


In reference to the work I just did on hot and cold starts for pitcher (K-BB)/PA, I was thinking this:

Let’s say that we decide or determine that a certain performance or stat should be aggressively weighted by recency, for whatever reason.  Presumably, it is because players, with regard to that stat, tend to change their true talent level either a lot or by a significant amount, or both.  But it doesn’t matter why, let’s just assume that that is the case.

I mentioned in that thread that we have to use an “effective” sample size for regressive purposes when we weight a series of stats.  For example, if we have 100 PA in year X and 100 PA in year X+1, and we weight year X+1 100 times more than year X (essentially ignoring year X), we can’t include the 100 PA in our sample for regression purposes, can we?  No, of course not.  But this creates a conflict.  If we now have a much smaller sample to regress, due to our aggressive weighting, we effectively nullify the effect of that aggressive weighting! 

Let’s say that in 1996-2006, our player has a rate of 10 (for anything) in 5000 PA and in 2007, he has a rate of 20, in 100 PA.  And let’s also assume that the league average of this guy’s population is 5.  We are going to weight 2007 100 times more than 2006, so we are essentially ignoring all of the prior years.  So we now have an effective sample size of only 100, and we regress toward a league average of 5 and we come up with something LESS (depending on the regression equation of course) than this guy’s career average!  That can’t be right.

Even though you are weighting 2007 very heavily, you would have to regress this guy’s 2007 stats (the 20) toward something other than the league average of 5.  You have to somehow take into consideration the fact that the guy has had 10 prior years with a rate of 10, even though you are weighting the 100 PA in 2007 so heavily.

So the question is, if you have a guy with a lot of history, and you are aggressively weighting recent performance, what do you use for your sample size to do the regression, and what do you use for your mean to regress toward?

Let’s use a more practical example.  We are using a weighting system that weights each year 3 times the previous year.  A player has 500 PA in each of 3 straight years, and then 300 in year 4 and 300 in year 5.  For this particular stat, the regression equation we are going to use is 600/(PA+600).  The population mean is 5.  Our player has averaged 20 in the first 3 years (for 1500 PA total ), and 30 in year 4 and 5 (for 600 total PA).

How would we do a basic Marcel using weighting and regression?  I don’t see using mostly year 4 and 5, which we will end up doing because of the aggressive weighting, and then regressing only 600 or so PA toward 5 (the population mean).  That does not make sense as this guy has another 1500 PA at a rate of 20 that we are essentially ignoring.  Surely their must be another way.  We are essentially saying, “We are pretty sure that this guy is an above average player (with regard to this stat) since he is way above the population mean in 2100 PA,” but we also suspect that his true talent has changed after 2006 and 2007.  How do we handle both of these things.  If we just do a weighted average of all 5 years, and then regress toward the population mean using our “effective number of PA” (which will be close to only 600, the last 2 years of PA), we will come up with a bad answer because we will be regressing too much.  I have no problem with the weighted average, but what do we use for the number of PA in the regression equation?

#1    Tangotiger      (see all posts) 2008/05/09 (Fri) @ 09:56

If I may summarize for the benefit of those who may be missing it.

We are always giving a “weight” of 1 to the sample that we trust the most.  The other samples that we trust less gets a percentage of that weight.  So, if you, for some reason, decide to trust 156 PA of someone who has a 39/2 K/BB ratio far more than his career, then you are weighting his 156 PA as “1” (total of 156) and his other 3000 (or whatever) career PA at “0.1” each, for a total of 456 PA.

However, someone with a more reasonable K/BB ratio would get a weight of “1” for his most recent performance, and perhaps “0.5” for the rest of his career, thereby giving him say 1656 PA. 

So, even though he’s got the exact same number of PA as Cliff Lee, he’s got more weight.

Normally, when we regress, we basically give each pitcher the same, say “league average of 250 PA”.  But, as we can see here, that doesn’t make any sense.

***

That’s the summary. 

I don’t know the answer.


#2    MGL      (see all posts) 2008/05/09 (Fri) @ 16:07

I thought about this a lot last night. I know there is a mathematical answer, as all of these things are just a basic Bayesian problem (what is the probability he is a true X, given what he has produced in Y number of trials, given the chance that his true talent has changed from A to B, and given the distribution of true talent in the population we thinks he comes from.

But I still cannot come up with a reasonable way to do a Marcel, the way we usually do it, given a heavy weighting for recent performance, and the example I give.

Tango, can you try and come up with an answer to the example I gave?  Come on, you are good at this stuff!

I’ve never seen you turn down a challenge like that!


#3    tangotiger      (see all posts) 2008/05/09 (Fri) @ 17:26

This is something Andy can do in his sleep I think.  But, you are right, it is somewhat challenging to think about, and try to make sense of.  Let me sleep on it…


#4    william      (see all posts) 2008/05/09 (Fri) @ 17:48

immediate guess as a method which might give a reasonable ballpark figure:

Use the amount of performance we are giving the heavy weighting to, regressed towards are previous estimate of his talent (i.e. all his previous performance regressed towards the league mean):

For the basic example, the 20 would be regressed an appropriate amount towards 9.9 (or whatever is appropriate).

Thinking some more I’m fairly convinced this is not right(see below).
Sanity test: Suppose our regression equation is 100/100+PA, league mean is 5 and we have a 200PA sample at a rate of 10. Our standard result would be 8.333

divide the sample into 2 100PA samples and apply the suggestion I made above.

We regress to an estimate of 7.5, then regress 10 halfways towards 7.5 to get 8.75.

Late Night statistics results in failure


#5    william      (see all posts) 2008/05/09 (Fri) @ 17:57

ah - the 2nd regression eqution should be altered to 100/200 +PA, where the 200 is the 100 from the initial weighting (the population mean) plus the 100 from the first sample.


#6    MGL      (see all posts) 2008/05/09 (Fri) @ 19:45

I was going to say that I though your initial idea was good. Then I saw how your example did not work.  I don’t get your post #5.  The first regression puts us at 7.5, as you say.  Now we have another 100 PA at 10.  We want to regress the 10 towards the 7.5, in order to hopefully get 8.333.  That requires a regression of 67% of course, which implies a regression equation of 200/PA+200 (I think that is what you meant).  But I am not sure where that regression equation comes from, when our original one was 100/100+PA.  IOW, I am not sure how to derive that equation given different parameters.


#7    will      (see all posts) 2008/05/10 (Sat) @ 04:40

sorry - you are correct that the equation in #5 should be 200/200 + PA. The idea for altering the regression is that we are more confident in our estimate of his ability (of 7.5 in the example) after 100PA than we were in out initial estimate of league average (5). Given that the initial regression equation is 100/100+PA we are implicitly giving our knowledge of league average a weight of 100PA. The mean to be regresseed to outside of the sample data is based on this amount.

Whereas for the second regression, the mean we are regressing to is based on our knowledge of the league mean (given a weight of 100PA) AND the first sample of 100PA. The regression equation then alters to be (100+100)/(100+100+PA).

More generally I would guess:

(A+X*y)/(A+X*y+PA)

where A is the number from our inital equation, y is our number of PA in the previous sample and x is the weighting we are giving to them.

N.B. Take with healthy pinch of salt, this is pretty much guesswork


#8    MGL      (see all posts) 2008/05/10 (Sat) @ 04:46

I don’t think you can alter the regression equation - only the variables within it.

The regression equation is based simply on the spread of talent in the population. That never changes.

Take that, as well, with a hefty pinch of salt!


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main