Thursday, April 08, 2010
Possible radical change in UZR methodology - need advice…
Most of you know the basic methodology for determining the league-wide out rates for each batted ball in UZR. It is a zone based system, where I arbitrarily create zones based on the “exact” (they are actually recorded as an x/y coordinate on the field) locations of the batted balls that are not too small (in order to increase sample size within that zone) and not too large such that we dilute the results too much, especially for small samples of player data.
Obviously one of the ways to get the best of both worlds, is to have very small zones but have one zone inform the other. For example, say you have 5 small adjacent zones directly behind the normal position of the typical CF’er and the catch rates for those 5 zones are .20, .21, .18., .15, and .13. Well, we can assume that the .21 is “wrong” (due to the relatively few # of balls hit and the different trajectories in that zone, as well as stringer error) and we can “smooth out” those zones, such that we make the out rates something like .20, .19, .18, .16, and .15, as long as the total out rate for all zones being “smoothed out” are the same before and after the smoothing.
I have wished for a long time that I did something like that instead of using the relatively large zones that I use. Even with the large zones I use, I am sure I could also do some smoothing like I described above.
For reasons which I won’t go into, I have not done that or anything like it.
Of course, better than just manually smoothing the out rates in the zones would be to use some kind of a smoothing function or a best fit model which incorporates the location parameters. Basically a best fit regression line I guess.
I think that the SAFE system does something like that, and I have always admired that kind of methodology (without knowing exactly what they do).
I didn’t feel like I had the expertise to do that, plus I didn’t really want to go back and start undertaking some large reconstruction of the UZR methodology since I think it works fairly well as it is.
Anyway, with all the publicity and exposure that UZR has been getting, I am feeling some pressure to improve the model.
Anyway, recently my idea was this (see below this paragraph, after the jump
): I would like some input from our smart readers as to whether this is correct or not and whether you think it will improve UZR (which would be a guess on your part of course) enough to be worth the trouble. It actually might not be too much trouble.
Oh, and I am thinking about this mostly for outfield (air) balls, but it could apply to ground balls as well. I would like your opinion on that too.
First, look at the out rates for various coordinates on the field to try and determine the point where it is most likely that the fielder is positioned (on the average, but I will do this separately for various situations - namely R/L batter and the average power of the batter). Is there a mathematical way to do this? Construct a plot (like a heat map) and just eyeball the graph? Is there a calculus or some other formula that can do this?
Anyway, once I have the average coordinate for where the LF, CF, and RF is stationed, the rest is simple. Simply construct either a regular or a logit regression with the x and y (horizontal and vertical distances) vectors from those points (where the fielder is normally stationed) are the independent variables and the catch rate (for a bunch of balls obviously) or a dummy variable of whether the ball was caught or not is the dependent variable. That is it!
Do you guys think it will work and whether it might be substantially better than using a zone system where each zone is independent of the others (no zone is informing any other zone), which is what I do now?
Which would be better to do - a logit regression where each data point is the distance, in x and y, from the average fielding position, and whether the ball (line drives and fly balls as separate regressions of course) was caught or not, or a regular regression where each data point is a bunch of balls in the same general location on the field (the average distance, x and y, from the fielder’s normal position) with the dependent variable being the average catch rate for those balls in that location.
I think the logit version is a bit more rigorous since I am using each ball as a data point and I don’t have to combine a bunch of points to get a catch rate as the independent variable, but I think that the results would essentially be the same. Actually the logit would be easier and cleaner.
Any input from you stats guys, please assume that you are talking to a first year STATS student. IOW, dumb down your comments and suggestions. It is a pet peeve of mine when someone asks for advice from experts and those experts give that advice such that a non-expert cannot understand it. I have experienced that so many times - I seek out an expert in order to improve or understand something I am working on, and I leave the discussion shaking my head…
Thanks!
MGL, I know that Dave Allen and some of the other saber gang do LOESS regression for the sort of problem you are talking about.
Now that has exhausted my knowledge on the subject, so perhaps someone can chime in who understands what LOESS does and how to use the results in the way you need.