Tuesday, September 30, 2008
HR rates by height
Reason? Sampling bias. Who are the players 6’5” and greater? And do they appear in both sets (aging)? You only have 10% of the sample, and so, much more likely for wild swings. Create 4 groups of 25% of the ballplayers, and I’d bet you get smooth results.
Although you never know, I would guess most of it is sampling error. Maybe David can provide the standard errors of those numbers so we can get a sense of how much sampling error there is.
And remember that since we are dealing with two samples, the total standard error (of the difference or ratio between the two samples) is the square root of the combined variances.
And, as Tango implies, unless you are using matched pairs, which I don’t think he is, you may have selective sampling problems. If the environment changes (steroids, the baseball, whatever), the criteria for who plays the most innings might change.
Or David might be right about the steroids thing. However, I don’t really buy the “tall players are not good at defense, so they have an incentive to take steroids” argument. That sounds like a contrived argument to fit the data.
What if only the small and medium players had a precipitous drop-off in HR rates, but not the tall guys? I could easily say, “Well small and medium guys have incentive to take steroids while the tall guys naturally hit more HR anyway.” Which is true.
Or what if only the tall guys had a large drop-off in HR but not the short and medium ones? I could say, “The smaller guys (short and medium) are mostly playing for their defense anyway, whereas the tall guys, who are most likely playing for their offense in the first place, need to make sure that they maintain that offense (by taking steroids).”
Got to be really careful about speculation just to fit some unusual data. I’d much rather see an hypothesis which is on firm ground first, and then an “experiment” with data to support that hypothesis or not.