Friday, May 09, 2008
Amount of regression: Here is an interesting question having to do with one of the other threads…
Maybe Tango and some of the other stat guys can take a stab at this.
In reference to the work I just did on hot and cold starts for pitcher (K-BB)/PA, I was thinking this:
Let’s say that we decide or determine that a certain performance or stat should be aggressively weighted by recency, for whatever reason. Presumably, it is because players, with regard to that stat, tend to change their true talent level either a lot or by a significant amount, or both. But it doesn’t matter why, let’s just assume that that is the case.
I mentioned in that thread that we have to use an “effective” sample size for regressive purposes when we weight a series of stats. For example, if we have 100 PA in year X and 100 PA in year X+1, and we weight year X+1 100 times more than year X (essentially ignoring year X), we can’t include the 100 PA in our sample for regression purposes, can we? No, of course not. But this creates a conflict. If we now have a much smaller sample to regress, due to our aggressive weighting, we effectively nullify the effect of that aggressive weighting!
Let’s say that in 1996-2006, our player has a rate of 10 (for anything) in 5000 PA and in 2007, he has a rate of 20, in 100 PA. And let’s also assume that the league average of this guy’s population is 5. We are going to weight 2007 100 times more than 2006, so we are essentially ignoring all of the prior years. So we now have an effective sample size of only 100, and we regress toward a league average of 5 and we come up with something LESS (depending on the regression equation of course) than this guy’s career average! That can’t be right.
Even though you are weighting 2007 very heavily, you would have to regress this guy’s 2007 stats (the 20) toward something other than the league average of 5. You have to somehow take into consideration the fact that the guy has had 10 prior years with a rate of 10, even though you are weighting the 100 PA in 2007 so heavily.
So the question is, if you have a guy with a lot of history, and you are aggressively weighting recent performance, what do you use for your sample size to do the regression, and what do you use for your mean to regress toward?
Let’s use a more practical example. We are using a weighting system that weights each year 3 times the previous year. A player has 500 PA in each of 3 straight years, and then 300 in year 4 and 300 in year 5. For this particular stat, the regression equation we are going to use is 600/(PA+600). The population mean is 5. Our player has averaged 20 in the first 3 years (for 1500 PA total ), and 30 in year 4 and 5 (for 600 total PA).
How would we do a basic Marcel using weighting and regression? I don’t see using mostly year 4 and 5, which we will end up doing because of the aggressive weighting, and then regressing only 600 or so PA toward 5 (the population mean). That does not make sense as this guy has another 1500 PA at a rate of 20 that we are essentially ignoring. Surely their must be another way. We are essentially saying, “We are pretty sure that this guy is an above average player (with regard to this stat) since he is way above the population mean in 2100 PA,” but we also suspect that his true talent has changed after 2006 and 2007. How do we handle both of these things. If we just do a weighted average of all 5 years, and then regress toward the population mean using our “effective number of PA” (which will be close to only 600, the last 2 years of PA), we will come up with a bad answer because we will be regressing too much. I have no problem with the weighted average, but what do we use for the number of PA in the regression equation?