Friday, October 24, 2008
First half, second half splits
Dewey was one of my favorite players. He’s also played in only three all-star games, a shockingly low total for a player who is at least a borderline hall of famer. Jim Rice has been in 8, and Fred Lynn in 9. Dave Parker in 7. Dave Winfield 12. Andre Dawson 8. Those are his peers, more or less. I could go on, but I’d guess all his peers were in at least 5. The reason, back when I was a kid and didn’t really study the issue, was that Dwight Evans would get hot in the second half, and so, lost out on the half-year/popularity All-Star game. His career stats show him with about a 13 point improvement in wOBA in the second half based on around 5000 PA. One SD is around 7 wOBA points, so, there may be something there, not only in my memory (13 points) but even in the significance of his second-half performance. Then again, I cherry-picked him and so, we expect some players to be at the 2 SD level, just by chance.
We have a similar player in our midst: Johan Santana. (Hat tip: Joe Poz reader.) On around 3000 PA, his wOBA difference is around 28 points or so comparing 1st and 2nd half. One SD is 9 points, so he’s at THREE SD. You expect to see 99.8% of all data at between -3 and +3 SD. It’s certainly possible that Johan is among the handful of players that simply is at the extreme range by pure luck. But, why does it have to be the best pitcher of this decade?
I’d love to see someone tackle the issue of half-splits (preferably using the All-Star game as the split) by this method and see who are the true extremes, and if the standard deviation of all the z-scores is more than 1, or equal to 1.


Or just run a correlation for all players. My guess is that it is as close to zero as possible, especially if you remove catchers. In fact, I will lay 2-1 that the correlation is between .1 and -.1. Once you do that (run the correlation), and it is near zero, you’ve pretty much ended the story. The essentially means that everyone’s splits gets regressed nearly 100%, so who cares what any individual’s splits are, whether it is 1 SD or 3 SD.
With all due respect to Tango (in this case, I actually mean that), I hate looking at a split and pointing out one or more players that have an extreme one, without first looking at correlation or a similar measure (such as what Tango is suggesting, which is to look at the SD of the Z scores, to see if the distribution looks random, I guess, or is more or less spread out - less is unlikely of course).
I mean, we can play this game forever. Make a list of the players who have extreme splits for anything you can possibly imagine (1st half.2nd half, day/night, home/road, half the opposing teams/the other half, odd days/even days, ad infinitum). My guess is that virtually every single one of them will have a regression of near 100% other than the obvious ones that we know have a spread in “talent”, like some platoon differentials.