Monday, October 27, 2008
Do first_half/second_half splits mean anything?
I thought the results were important enough to warrant its own thread, rather than continuing the last one on first half, second half splits. If you didn’t follow the discussion in the thread I was talking about, here is the link:
http://www.insidethebook.com/ee/index.php/site/article/first_half_second_half_splits/
To summarize the discussion, a few people pointed out some unusually large first half, second half splits in performance for some players. The discussion is whether for players in general, those splits “mean” anything, which is the same thing as asking whether they have any predictive value, which is the same thing as asking whether they correlate to any degree from one year to another. For example, we find that platoon splits for RHB have very little predictive value. No matter what a RHB platoon splits are in any given time period, they will tend to revert to near league average for all RHB in any other time period. For LHB, there is some predictive value - the larger the sample size of data we have, the more predictive those sample results are.
For RHB (since there is some predictive value), we might need 10 years of split data to “tell us anything” about that player’s true talent platoon ratio or difference. For LHB, it might be 2 or 3 years of data.
Anyway, I was skeptical that any sample of first half and second half splits means anything, i.e., has any predictive value. Of course, even if there is a tiny amount of predictive value in any sample data, if the sample is large enough we eventually get tremendous predictive value. But, in baseball, we really only get to use one year at the least and maybe 5 or 10 years at the most, worth of data to have any practical significance, of course. If we have to wait until we get 15 or 20 years of data for it to have much predictive value, that is not particularly interesting, to me at least.
Anyway, one way to see how much predictive value there is in a certain amount of data, we can run a regression from one time period to another. If the correlation is really low, then there is little or no predictive value to that particular stat for that amount of data (the number of opportunities underlying each element in the regression). Hopefully we have enough data (both data points in the regression and a decent sample size for the underlying number of opportunities for each data point), such that the uncertainty in the resultant “r” is fairly low (a small standard error).
I did such a regression on first half, second half splits. Here is the methodology and the results:


Recent comments
Older comments
Page 1 of 344 pages 1 2 3 > Last »Complete Archive – By Category
Complete Archive – By Date