Wednesday, April 30, 2008
Archives: Component Regression
This is a blast from the past. I highly recommend reading MGL’s article, or at the very least, all the comments.
Buy The Book from Amazon
This is a blast from the past. I highly recommend reading MGL’s article, or at the very least, all the comments.
Feb 12 03:15
New PECOTA
Feb 12 02:42
Whitney Houston
Feb 12 02:23
Psst… wanna intern in Canada?
Feb 12 01:57
Who is Jeremy Lin?
Feb 12 00:40
Clutch analogy
Feb 12 00:38
Reader Mail of the Day: Why do we need X years of fielding data? And what about outliers?
Feb 11 20:11
Fighting leads to goals?
Feb 11 19:55
Why do players get crappy caps?
Feb 11 19:12
Hero of the month: Brittney Baxter
Feb 11 17:59
MGL: Today on Clubhouse Confidential
That’s a good article. and I am glad that I read it again. It reminds me that I used to be able to write a little more clearly and with less verbosity and convoluted than I do not (and with fewer parenthetical statements).
I wrote this article before I knew how to figure out (approximate at least) come up with regression coefficients from correlations or looking at the variance in sample distributions. I did it by the seat of my pants using sample data from two different years (kind of like a manual, poor man’s linear regression).
But the concepts are good, I think.
Of course regressing components individually is not correct (although not too bad, I don’t think), because they are not independent. You will essentially end up “overregressing” overall. However, figuring out how to regress dependent components is NOT easy. In fact, I’ve never seen or heard of anyone doing that.
I thought that my explanation of why DIPS (or FIP or xERA) ERA is good and bad was good. All of these essentially regress BABIP all the way and HR, BB, and SO, not at all, both of which are incorrect. Because BABIP, even for large samples of data, get regressed a lot, DIPS works well for large samples of data, but does NOT work well for very small samples, because it does NO regression on HR, BB, and K. For small samples of data, not regressing those components can be very misleading, especially if the sample rates are far from the population mean.
Some people seem to think that the only thing that is luck in a pitcher’s sample performance is his BAIP. Although that is going to have a much larger luck component than K, HR, and BB, rates, it (BABIP) is certainly NOT the only thing that indicates luck for a pitcher, especially in small samples. In fact, you are regressing one thing which needs to be regressed heavily, and although the other things do not need to be regressed nearly as much (given a certain sample size of course), there are 3 OF THEM in which you are completely ignoring regression. Unless you have a large sample of data, that can’t be a good thing.
That is like ignoring all the weeds in your back yard because there are only 1 or 2 of each species, but there are hundred of species of weeds, so the next thing you know, your whole back yard is filled with weeds.
OK, ignoring K, BB, and HR, isn’t quite as bad, but you know what I mean…