Wednesday, August 30, 2006
Forecasters: How Accurate Can They Possibly Be?
0.73
Here’s how we can tell:
If you have thousands of samples, say students, and they each take several hundred tests in one session, say 550, and then have those same students take that exact same number of tests, but new, we can determine the correlation coefficient (r) in two ways.
1 - A sample-to-sample regression.
2 - Using only the results from the first sesssion
The second one is based on taking the variance (true) of the students in question, and dividing by the variance (observed actual tests). That’s your “r”.
The problem of course is that we don’t know the variance(true). We could estimate it if we can plug it into the equation:
var(obs) = var(true) + var(random)
However, we don’t know var(random) either.
Enter the binomial. Let’s forget about students, and look at baseball players, and their OBP. Fortunately, an OBP is simply the safe plays divided by the safe plus out plays. So, we can determine the random standard deviation using the binomial.
sqrt(OBP*(1-OBP)/PA)
Remember also that SD^2 = variance
If you select a few hundred ballplayers every year with 550 PA, we can figure that the var(random) = .020 ^ 2. It’s also easy enough to observe their OBP and get the var(obs) as around .039 ^ 2, depending what years you select. var(true) is then derived from these two numbers as .033^2.
Our “r” is .033^2 / .039^2 = .72
What does this mean? You can take several hundred ballplayers, give them 550 PA one year, give them 550 PA another year, make sure that these guys’ true talent level in OBP does not change, make sure they play in the same parks, make sure they face the same quality of pitchers, and their year-to-year correlation will be 0.72. That is, the absolute maximum year-to-year r you can hope for, given a large number of ballplayers is .72.
How about instead of OBP, we look at wOBA (which is analogous to OPS)? Here, our var(true) is .036^2, var(random) is .022^2, and var(observed) is .042^2. Our r is .73.
So, when looking at forecasters, and you look at their correlation coefficient of their forecast to the actual results, anything close to .73 means that they did as good a job as possible. (They could actually go over that level, since the number of players in their sample is still small enough that the level of uncertainty of that r will be a bit high. But, given thousands of players over several years, that uncertainty level will drop quite a bit.)
The other key question is: how does Marcel The Monkey do? A few years ago, when I ran it, I think the r was .65. I’ll have to redo that to see what it actually is after several years of results. But, that’s what everyone is fighting for, to get from the .65 level to the impossible .73 level.
And remember, I used 550 PA for each player. Drop that down, and the maximum r will drop down as well.