THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, October 27, 2008

Do first_half/second_half splits mean anything?

By

I thought the results were important enough to warrant its own thread, rather than continuing the last one on first half, second half splits.  If you didn’t follow the discussion in the thread I was talking about, here is the link:

http://www.insidethebook.com/ee/index.php/site/article/first_half_second_half_splits/

To summarize the discussion, a few people pointed out some unusually large first half, second half splits in performance for some players.  The discussion is whether for players in general, those splits “mean” anything, which is the same thing as asking whether they have any predictive value, which is the same thing as asking whether they correlate to any degree from one year to another.  For example, we find that platoon splits for RHB have very little predictive value.  No matter what a RHB platoon splits are in any given time period, they will tend to revert to near league average for all RHB in any other time period.  For LHB, there is some predictive value - the larger the sample size of data we have, the more predictive those sample results are.

For RHB (since there is some predictive value), we might need 10 years of split data to “tell us anything” about that player’s true talent platoon ratio or difference.  For LHB, it might be 2 or 3 years of data.

Anyway, I was skeptical that any sample of first half and second half splits means anything, i.e., has any predictive value.  Of course, even if there is a tiny amount of predictive value in any sample data, if the sample is large enough we eventually get tremendous predictive value. But, in baseball, we really only get to use one year at the least and maybe 5 or 10 years at the most, worth of data to have any practical significance, of course.  If we have to wait until we get 15 or 20 years of data for it to have much predictive value, that is not particularly interesting, to me at least.

Anyway, one way to see how much predictive value there is in a certain amount of data, we can run a regression from one time period to another.  If the correlation is really low, then there is little or no predictive value to that particular stat for that amount of data (the number of opportunities underlying each element in the regression).  Hopefully we have enough data (both data points in the regression and a decent sample size for the underlying number of opportunities for each data point), such that the uncertainty in the resultant “r” is fairly low (a small standard error).

I did such a regression on first half, second half splits. Here is the methodology and the results: 


I used the database provided by our own “terpsfan.” First half, second half data from 1974 to 2007, for all players. I used OPS as the stat of choice.  I could have used lwts or RC. It does not make any difference.

I only looked at players who played on one team for the whole season.

I only looked at players who had at least 100 PA in the first and second halves.

There were 6427 data points (player seasons) in the regression.  I regressed the 1st half, second half OPS difference (1st half minus second half) for one year on the same thing for another year.

I overlapped years (e.g., 1974 on 1975 and 1976 on 1975), so the data points are not independent.  That is not the best thing to do, but it is no big deal. If I only used independent data pairs, I would get around the same results.

First I gave every data pair the same weight, IOW, if a player A had 120 PA in the first and second half in year 1, and around the same in year 2, that got the same weight as a player who had 200 PA in the first and second halves of years 1 and 2.

I redid the same thing, weighting each of the data pairs by the minimum number of PA in any half (out of 4 possible halves, 2 for each year).

Here is what I got:

Using the “same weight” method, I got an “r” of -.004 (again, 6427 player seasons).

Using the second method, weighting by PA, I got an “r” of .001.

Sorry guys, I see no evidence of these splits having any meaning whatsoever.  None.

I have an open challenge to everyone.  If anyone finds any predictive value to any splits other than the ones we KNOW have predictive value, I’ll donate 100 dollars to the charity of your choice, for each split.  Obviously we have to establish a minimum level of predictability for any given number of opportunities, say an “r” of at least .2 for one year of data.  And I (or someone else I trust) have to verify

(15) Comments • 2008/11/04 • SabermetricsStreaks
Page 1 of 1 pages

<< Back to main