Wednesday, August 30, 2006
Forecasters: How Accurate Can They Possibly Be?
0.73
Here’s how we can tell:
If you have thousands of samples, say students, and they each take several hundred tests in one session, say 550, and then have those same students take that exact same number of tests, but new, we can determine the correlation coefficient (r) in two ways.
1 - A sample-to-sample regression.
2 - Using only the results from the first sesssion
The second one is based on taking the variance (true) of the students in question, and dividing by the variance (observed actual tests). That’s your “r”.
The problem of course is that we don’t know the variance(true). We could estimate it if we can plug it into the equation:
var(obs) = var(true) + var(random)
However, we don’t know var(random) either.
Enter the binomial. Let’s forget about students, and look at baseball players, and their OBP. Fortunately, an OBP is simply the safe plays divided by the safe plus out plays. So, we can determine the random standard deviation using the binomial.
sqrt(OBP*(1-OBP)/PA)
Remember also that SD^2 = variance
If you select a few hundred ballplayers every year with 550 PA, we can figure that the var(random) = .020 ^ 2. It’s also easy enough to observe their OBP and get the var(obs) as around .039 ^ 2, depending what years you select. var(true) is then derived from these two numbers as .033^2.
Our “r” is .033^2 / .039^2 = .72
What does this mean? You can take several hundred ballplayers, give them 550 PA one year, give them 550 PA another year, make sure that these guys’ true talent level in OBP does not change, make sure they play in the same parks, make sure they face the same quality of pitchers, and their year-to-year correlation will be 0.72. That is, the absolute maximum year-to-year r you can hope for, given a large number of ballplayers is .72.
How about instead of OBP, we look at wOBA (which is analogous to OPS)? Here, our var(true) is .036^2, var(random) is .022^2, and var(observed) is .042^2. Our r is .73.
So, when looking at forecasters, and you look at their correlation coefficient of their forecast to the actual results, anything close to .73 means that they did as good a job as possible. (They could actually go over that level, since the number of players in their sample is still small enough that the level of uncertainty of that r will be a bit high. But, given thousands of players over several years, that uncertainty level will drop quite a bit.)
The other key question is: how does Marcel The Monkey do? A few years ago, when I ran it, I think the r was .65. I’ll have to redo that to see what it actually is after several years of results. But, that’s what everyone is fighting for, to get from the .65 level to the impossible .73 level.
And remember, I used 550 PA for each player. Drop that down, and the maximum r will drop down as well.
Hi Tango,
If I could paraphrase, your essentially arguing that 1) if we model the population of hitters as having two sources of variances--skill and error; and 2) we assume that the error variance can be approximated as the variance of n bernoulli trials with p = OBP and n = PA; and 3) we assume that no system can reliably predict the error variance; then we can emperically calculate the portion of the variance the best predictive system could achieve.
This seems reasonable, but what happens if we don’t fully accept #2? How would it change the situation if the batter didn’t have a static OBP, but rather had a varying “true” OBP that depended on the circumstances of each PA?
Excuse me while I think out loud here… Lets take a ridiculously extreme case, such that a players who appears to have a .335 for 550 PA has a “true” rate of .100 for 225 PA, and .570 for 225 PA. The variance (np(1-p)) for a constant .335 OBP guy would be 122 on-bases per 550 PAs, whereas for a half & half .100/.570 guy, it would be 75. The standard deviation from the .335 guys would be 11.0 on-bases per 550, whereas the .100/.570 guys would have a SD of 8.6 on-bases.
Do you think moving beyond a bernoulli model of hit probability could allow a predictive system to theoretically break the .73 barrier?