Tuesday, April 01, 2008
Cross-checking the data providers
Fabulous article by Peter Jensen:
Let’s take the two observers in closest agreement, BIS and Greg, split the difference between them and call that the best guess of the actual hit location. What is the minimum distance and degrees that will have 95 percent of both Greg’s and BIS’ observations included? The answer is +-18 feet and +-4 degrees. That’s a pretty big area. It is two whole zones in width.
...
It doesn’t matter if you have three observers or 3,000, the composite data will never have any less error than that of the two closest. Having many observers is only useful for finding those two best observers.
Fantastic stuff. And great point. Peter is right, that by throwing in as many observers as I can, I wouldn’t want to weight each one equally. The better the estimator (relative the other other 2999), the more I would weight that observer. Ideally, you’d be down to just one observer, the perfect guy. Realistically, you might have one observer carry 10% of the weight, another 9%, another 8%, and on an on, such that you only need about 20 observers out of the 3000.
However, his conclusion that the error is now 22 feet doesn’t necessarily mean that’s bad. If the two closest observers were within 18 feet of each other, but the third observer was in fact the best for a particular data point, I’m not sure that we’d want the 18 feet. For example, MGL and Marcel have a similar forecasting engine as its basis, while Chone does not. By selecting the two closest in agreement (MGL, Marcel) doesn’t mean that it’s necessarily bad if we also include Chone. Perhaps Greg and BIS are biased in the same manner (rely more on video than in-park).
Question to Peter: what is the correlation of STATS, BIS to Greg? And what is the weight for each of those two? Repeat for the other combinations. Couldn’t we come up with a better estimate of where a ball landed based on different weightings?