Tuesday, April 01, 2008
Cross-checking the data providers
Fabulous article by Peter Jensen:
Let’s take the two observers in closest agreement, BIS and Greg, split the difference between them and call that the best guess of the actual hit location. What is the minimum distance and degrees that will have 95 percent of both Greg’s and BIS’ observations included? The answer is +-18 feet and +-4 degrees. That’s a pretty big area. It is two whole zones in width.
...
It doesn’t matter if you have three observers or 3,000, the composite data will never have any less error than that of the two closest. Having many observers is only useful for finding those two best observers.
Fantastic stuff. And great point. Peter is right, that by throwing in as many observers as I can, I wouldn’t want to weight each one equally. The better the estimator (relative the other other 2999), the more I would weight that observer. Ideally, you’d be down to just one observer, the perfect guy. Realistically, you might have one observer carry 10% of the weight, another 9%, another 8%, and on an on, such that you only need about 20 observers out of the 3000.
However, his conclusion that the error is now 22 feet doesn’t necessarily mean that’s bad. If the two closest observers were within 18 feet of each other, but the third observer was in fact the best for a particular data point, I’m not sure that we’d want the 18 feet. For example, MGL and Marcel have a similar forecasting engine as its basis, while Chone does not. By selecting the two closest in agreement (MGL, Marcel) doesn’t mean that it’s necessarily bad if we also include Chone. Perhaps Greg and BIS are biased in the same manner (rely more on video than in-park).
Question to Peter: what is the correlation of STATS, BIS to Greg? And what is the weight for each of those two? Repeat for the other combinations. Couldn’t we come up with a better estimate of where a ball landed based on different weightings?


It doesn’t matter what the actual landing location is. There is no possible actual landing location that has any two of the four systems observed landing points within 18 feet and 4 degrees 95% of the time. That doesn’t mean that Greg or BIS or STATS might not be exactly right with 100% of their observations. But if Greg is exactly right, it means that both BIS and STATS, who are each trying equally hard to be correct, are each going to be at least 18 feet or 4 degrees away from the correct (Greg’s) location at least 5% of the time.
You can’t weight systems and get a smaller error. You can’t add observers and get a smaller error, unless one of those observers is closer to one of the two “best” observers that you already have.