THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, April 08, 2010

Possible radical change in UZR methodology - need advice…

By , 12:01 AM

Most of you know the basic methodology for determining the league-wide out rates for each batted ball in UZR.  It is a zone based system, where I arbitrarily create zones based on the “exact” (they are actually recorded as an x/y coordinate on the field) locations of the batted balls that are not too small (in order to increase sample size within that zone) and not too large such that we dilute the results too much, especially for small samples of player data.

Obviously one of the ways to get the best of both worlds, is to have very small zones but have one zone inform the other. For example, say you have 5 small adjacent zones directly behind the normal position of the typical CF’er and the catch rates for those 5 zones are .20, .21, .18., .15, and .13.  Well, we can assume that the .21 is “wrong” (due to the relatively few # of balls hit and the different trajectories in that zone, as well as stringer error) and we can “smooth out” those zones, such that we make the out rates something like .20, .19, .18, .16, and .15, as long as the total out rate for all zones being “smoothed out” are the same before and after the smoothing.

I have wished for a long time that I did something like that instead of using the relatively large zones that I use.  Even with the large zones I use, I am sure I could also do some smoothing like I described above.

For reasons which I won’t go into, I have not done that or anything like it.

Of course, better than just manually smoothing the out rates in the zones would be to use some kind of a smoothing function or a best fit model which incorporates the location parameters.  Basically a best fit regression line I guess.

I think that the SAFE system does something like that, and I have always admired that kind of methodology (without knowing exactly what they do).

I didn’t feel like I had the expertise to do that, plus I didn’t really want to go back and start undertaking some large reconstruction of the UZR methodology since I think it works fairly well as it is.

Anyway, with all the publicity and exposure that UZR has been getting, I am feeling some pressure to improve the model.

Anyway, recently my idea was this (see below this paragraph, after the jump wink):  I would like some input from our smart readers as to whether this is correct or not and whether you think it will improve UZR (which would be a guess on your part of course) enough to be worth the trouble.  It actually might not be too much trouble.

Oh, and I am thinking about this mostly for outfield (air) balls, but it could apply to ground balls as well.  I would like your opinion on that too.

First, look at the out rates for various coordinates on the field to try and determine the point where it is most likely that the fielder is positioned (on the average, but I will do this separately for various situations - namely R/L batter and the average power of the batter).  Is there a mathematical way to do this?  Construct a plot (like a heat map) and just eyeball the graph?  Is there a calculus or some other formula that can do this?

Anyway, once I have the average coordinate for where the LF, CF, and RF is stationed, the rest is simple.  Simply construct either a regular or a logit regression with the x and y (horizontal and vertical distances) vectors from those points (where the fielder is normally stationed) are the independent variables and the catch rate (for a bunch of balls obviously) or a dummy variable of whether the ball was caught or not is the dependent variable.  That is it!

Do you guys think it will work and whether it might be substantially better than using a zone system where each zone is independent of the others (no zone is informing any other zone), which is what I do now?

Which would be better to do - a logit regression where each data point is the distance, in x and y, from the average fielding position, and whether the ball (line drives and fly balls as separate regressions of course) was caught or not, or a regular regression where each data point is a bunch of balls in the same general location on the field (the average distance, x and y, from the fielder’s normal position) with the dependent variable being the average catch rate for those balls in that location.

I think the logit version is a bit more rigorous since I am using each ball as a data point and I don’t have to combine a bunch of points to get a catch rate as the independent variable, but I think that the results would essentially be the same.  Actually the logit would be easier and cleaner.

Any input from you stats guys, please assume that you are talking to a first year STATS student.  IOW, dumb down your comments and suggestions.  It is a pet peeve of mine when someone asks for advice from experts and those experts give that advice such that a non-expert cannot understand it. I have experienced that so many times - I seek out an expert in order to improve or understand something I am working on, and I leave the discussion shaking my head…

Thanks!


#1          (see all posts) 2010/04/08 (Thu) @ 01:33

MGL, I know that Dave Allen and some of the other saber gang do LOESS regression for the sort of problem you are talking about. 

Now that has exhausted my knowledge on the subject, so perhaps someone can chime in who understands what LOESS does and how to use the results in the way you need.


#2    Colin Wyers      (see all posts) 2010/04/08 (Thu) @ 02:05

I described using LOESS for a fielding metric in an article for BP:

http://baseballprospectus.com/article.php?articleid=9999


#3    John Walsh      (see all posts) 2010/04/08 (Thu) @ 03:40

Regarding the regression: I can think of some problems with a linear regression using x and y as you propose.  First, I’m not convinced at all that you expect a linear relationship between probability of catching the ball and the x and y variables.  E.g., we know it’s easier to come in on a ball than to go back on it.  This would not be captured in your x,y regression. 

I bet you’d do better using the total distance, d = sqrt(x^2+y^2) and the angle (call it phi) that the fielder has to move to catch the ball—i.e. phi=0 for a fielder moving towards the plate, phi=90 for a fielder moving to his left, 180 going back, 270 to his right, etc.  This will allow you to capture the differences in going in or out, left or right.

However, even better than that would be a loess type regression—this doesn’t assume any particular linear (or other) dependence of the output on the input variables.  It basically, just smooths out the data points, which is what I think you want.  To do this, you could construct a heat-map style plot, where the data is binned in x and y and the contents of each bin is the out probability.  Then you perform the loess on the binned data.  The result will allow you to get the out probability for any given x,y value.  (x,y is as you defined it, the coordinates relative to the average position).

There might be other ways to do the loess smoothing, but this is how I would try it.


#4    MGL      (see all posts) 2010/04/08 (Thu) @ 03:51

Does a logit regression model have to be linear?  If yes, then what if I use a regular regression model with the x and y coordinates (horizontal and vertical distances from the starting point)as my ind. variables and the dep. variable is the out rate for that location (obviously I have to create bins to get the out rates for any given location)?  Then I can use whatever kind of curve results.  That sounds very similar to a loess if not the same thing.


#5    MGL      (see all posts) 2010/04/08 (Thu) @ 04:10

I don’t know that much about Loess, but it seems to me it is just another way of doing a best-fit curve where the data is such that no particular equation would apply.  And you can’t predict the dep. variable using a formula, right?  You have to plot your independents on the graph and then see where the dep. variable ends up, although I would have no idea how to do that with more than 1 ind. variable - I guess a computer program would do that for you, which is essentially the same as using a regression formula.  Is that right?

I would think that the probability of catching a ball as a function of starting point, using, maybe angle and distance as John suggests, should generate a curve that would be fine for a regular (linear or non-linear) regression model.  I would think that even a low order polynomial if not a linear relationship would apply. I don’t see how Loess is necessary, but as I said, I don’t know much about it. Isn’t it only necessary when you have data that is all over the place and the curve is constantly changing directions and what have you?

Colin, have you actually created your full defensive model?


#6    Depot      (see all posts) 2010/04/08 (Thu) @ 04:14

It doesn’t have to be linear, but I would caution against using a logit model anyway.  The more you interact your variables and put in different powers of those terms, the more it will approximate a non-parametric function.  But using logit automatically adds a lot of structure since it makes assumptions on the error term.

Anyway...I want to suggest coming at this from a different angle.  It seems like we always think of defensive (UZR-like) stats coming from the following: 1) divide up the field into zones (well-defined or more fluid); 2) put each ball into a zone.  Why not do something similar to this: take a fly ball with coordinates (x,y).  Now, draw a circle around those coordinates large enough (though, hopefully still very small) to contain 100 (or whatever you think is best) other fly balls.  Do the calculations using the fraction of balls in that circle that were caught.  I think I’m assuming that the distribution around the centered fly ball is even (am I?), or maybe you can re-weight some observations.

I guess, broadly, my suggestion is not to start with the zones and then put fly balls in those zones.  Start with the fly ball and then figure out what to do with it.  Does that make sense?  Obviously, I’m hoping other people add on to this, such as implicit assumptions.


#7    MGL      (see all posts) 2010/04/08 (Thu) @ 05:05

"Start with the fly ball and then figure out what to do with it.  Does that make sense?”

Wow, that is interesting.  I never thought about that. That makes a lot of sense.  I’ll have to think about how to do that, computer-wise.

“I think I’m assuming that the distribution around the centered fly ball is even (am I?), or maybe you can re-weight some observations.”

What do you mean by that?


#8    MGL      (see all posts) 2010/04/08 (Thu) @ 06:52

While that (creating the zone around the ball) is a fantastic idea, I don’t know that I have the computing power for that. I am programming in DOS (I know, that’s horrible)…


#9    Brian Cartwright      (see all posts) 2010/04/08 (Thu) @ 06:59

"in DOS” as in a command line prompt running under Windows? That’s all I’ve ever used.


#10    Peter Jensen      (see all posts) 2010/04/08 (Thu) @ 08:48

MGL - What would you guess the error estimates to be for a single full year of a fielder’s UZR to be as it stands now?  How about the error per year based on 3 full years of UZR?  How much improvement do you think would occur if you had exact hit ball locations, and hang times instead of the estimates that you have now?  How much sample size error is going to remain in a year’s worth of data even if you had a exact hit ball locations and fielder locations and a “perfect” fielding metric?  Answer those questions and you should be able to estimate how much improvement you are going to get by smoothing the zones as you are proposing above.  I would be surprised if it will amount to more than a run or two less error per year.


#11          (see all posts) 2010/04/08 (Thu) @ 09:35

I think what he is saying with the distribution around the ball is…

...Imagine a fly ball is hit down the line in left field at Fenway.  There’s maybe 3 feet between the foul line and a giant wall that’s 10+ feet high.  Imagine this ball is dropping somewhere in between the foul line and the wall.

If you took the 100 nearest fly balls to that one, and looked at their probability of being caught, pretty much ALL of them these 100 “comparables” would be inside fair play, closer to where the fielder originally was stationed.

If fly balls are uniformly (or randomly) distrubuted around a given fly ball, then I think it makes sense that the fly ball in question will have a chance of being caught that’s about equal to the average of it’s 100 closest comparables.  But, in certain instances, like in Fenway’s Left Field example, the 100 closest comparables will pretty much ALL be EASIER to catch than the ball in question.


#12    J. Cross      (see all posts) 2010/04/08 (Thu) @ 09:46

I like the idea of using polar coordinates (angle and distance) instead of x any y here but I’m guessing that the problem with combining them in a good old regression is that the it’s the slope of catchability with respect to distance from the starting point that changes with the angle and not really the intercept of catchability so something like:

caught = function(angle)*function(distance)

makes more sense to me than something like

caught = f(angle) + f(distance)


#13    John Walsh      (see all posts) 2010/04/08 (Thu) @ 09:56

mgl/5

I’m not a stats person at all, but I think you’ve described Loess pretty well.  The resulting “best fit” is not expressible as a formula, but the software you use to do the Loess will provide the predicted values. 

It’s possible that a simple regression with polynomials in x, y (or d, phi) might describe the data and that a Loess is unnecessary.  You’d do well to verify that, though.

depot/6

That is an interesting idea, but I don’t see how it solves the smoothness problem.  The average out prob in your circle of 100 balls will have a binomial error of around 15% (assuming 50% out prob), which would lead to a very unsmooth distribution.

It would also seem very intensive computationally.  For every ball that you consider, you need to assume some circle size, run over _all_ the data to populate your circle.  If you don’t get the required 100 balls, you need to enlarge your circle and run over all the data again.  This for every ball you want to evaluate.  What you’d end up doing is mapping out an x,y grid and calculating your out probs with the circle method at each point.  But, that is equivalent, more or less, to binning the data, i.e. defining zones.


#14    Rally      (see all posts) 2010/04/08 (Thu) @ 10:31

I think this is a very cool idea, and I’m all for using the best method you can.  But I doubt you’ll find a huge difference in the results.  My guess is if you compare the results before and after the change you’ll find most players within a run of where they were, maybe one or two guys who change by 3 runs or so.

If you went in the other direction, used no zone at all other than who fielded the ball, you’d have TotalZone.  And for most players we’re still getting similar results.

I have a question for Peter or anyone who’s tried using the MLBAM hit location data:

How do you match that up that file with the play by play record?  The inning_hit.xml file gives you the batter, pitcher, and inning, and result.  Problem I see matching this up with a PBP file is sometimes you could have a lineup bat around against a pitcher.  How do you handle that?


#15          (see all posts) 2010/04/08 (Thu) @ 10:37

MGL -

I use a smoothing function - P’(x,y) = 0.5*(P(x,y) + (1/8)*(P(x+1,y+1) + P(x,y+1) + P(x,y-1) + ... + P(x-1,y-1))

Essentially this says that the observer was 50% likely to have recorded the right location and 50% likely to have been one grid quadrant off.  I actually extend it to being two quadrants off.

Anyways, I used that when I made my own version of UZR and I use it for my NHL shooting charts.  It’s a technique that comes from image processing.


#16          (see all posts) 2010/04/08 (Thu) @ 10:59

Rally/14, I enter the pbp info into the database first, then I come back and match up the info in the inning_hit.xml file second.

You can see all my parsing code here:
http://codepaste.net/gjbeyv

I query my database to find how many atbats exist for the given game, inning, hitter, and pitcher combination.

$find_ab_id_query 'SELECT ab_id, hit_x, event FROM atbats WHERE (game_id = ' $select_game_id ' AND inning = ' $hit_inning ' AND batter = ' $hit_batter ' AND pitcher = ' $hit_pitcher ')';

Depending on how many matching atbats there are, I handle them differently.


#17    Peter Jensen      (see all posts) 2010/04/08 (Thu) @ 11:08

Rally - If I recall correctly there are about 20 to 30 double entries a year from batting around.  I haven’t found a way to correct them except by hand.


#18          (see all posts) 2010/04/08 (Thu) @ 11:16

I’m not a mathematician, but more of a philosopher (or dilettante) with a smattering of math understanding.  But here’s my proposal:

It seems to me that if we measure the amount of time each ball takes to either hit the ground or be caught, we can do away with the whole line drive/fly ball distinction in favor of a much more accurate measurement.  We can then incorporate this with the zones in the UZR, and factor in the historical value of a ball hit in such a zone which takes such a time to reach the ground.

Once we are able to assign such values, we can have an objective way to measure the defensive contribution of fielders in terms of run prevention.  I realize this would take quite a lot of work to set up and get a historical measurement, but I don’t see why it wouldn’t be possible given the vast amount of tape we have for pretty much every game for the last however many years.


#19    Colin Wyers      (see all posts) 2010/04/08 (Thu) @ 11:56

Why not do something similar to this: take a fly ball with coordinates (x,y).  Now, draw a circle around those coordinates large enough (though, hopefully still very small) to contain 100 (or whatever you think is best) other fly balls.  Do the calculations using the fraction of balls in that circle that were caught.  I think I’m assuming that the distribution around the centered fly ball is even (am I?), or maybe you can re-weight some observations.

That’s, essentially, what the LOESS regression does. (It actually does a least-squares fit based upon the nearby data points, weighting based upon the distance from the point in question.)


#20    dq      (see all posts) 2010/04/08 (Thu) @ 12:16

#13/#6

I like the comparable fly balls, but understand the problem with the 100 closest comps as you attempt to identify 100.

What if instead of using the 100 closest balls, you use all balls within x distance (say 3 feet) of the batted ball - it could be 1 foot, it might be 5 feet I’m making up the 3 feet -

So, for this batted ball A, all other balls within 3 feet of it are caught at a 63% rate, so this ball should be caught 63% of the time. There may be 600 balls in this sample - there might be 50 - If your sample is too small you obviously have to compensate for it.


#21    Colin Wyers      (see all posts) 2010/04/08 (Thu) @ 12:20

DQ,

Why reinvent the wheel?

Statistics already has a tool that does what we’re looking for here - and does it better than simply identifying 100 nearby balls or all balls within x, as it uses weighting rather than instituting an arbitrary cutoff.


#22    Brian Cartwright      (see all posts) 2010/04/08 (Thu) @ 12:20

Rally/14

I haven’t done it for the hit chart yet, high on my to-do, but I have done the procedure on another part of the Gameday records

1. do a subquery joining events with hits on game, inning, batter & pitcher, and as there may be more than one, find the minimum event number

2. then join events to the subquery using the returned game and min(eventnum)


#23    Brian Cartwright      (see all posts) 2010/04/08 (Thu) @ 12:33

One correction to above - as described it will always only return the first match of the inning. The subquery join must be an inequality, where hits.eventnum > events.eventnum This will only return records greater than the current position in the events file, so that once the first matchup of the inning has been processed, the second would then have the minimum value, and so on.


#24          (see all posts) 2010/04/08 (Thu) @ 13:30

Has anyone considered cluster analysis to define zones? Once zones are defined, any individual ball can have its out probability calculated by the weighted average (based on the inverse squared distance from the cluster center) of the out probabilities of all the zones.


#25    MGL      (see all posts) 2010/04/08 (Thu) @ 13:41

Great stuff guys!


#26    dan      (see all posts) 2010/04/08 (Thu) @ 13:46

Marc-

Max used cluster analysis here, but for a different purpose:

profpeppersassistant.blogspot.com/2009/02/refining-shift.html

Dan in Philly:

Tango has been trying to get stringers to do that for a long time. MGL had a project last year where he had people recording the hang time on every fly ball, but I’m not sure whatever happened to it.


#27    Tangotiger      (see all posts) 2010/04/08 (Thu) @ 14:13

Hawerchuk: maybe you can show us the Finley/Lofton charts you did.  Those are super cool.


#28    Guy      (see all posts) 2010/04/08 (Thu) @ 14:44

MGL: 
Have you ever looked to see if your ability to predict out% is improved by accounting for any of:
Count;
Times thru order; and/or
Hitter quality (or lineup position)?
All of them of course have a relationship with BABIP (e.g. leadoff hitters were .318 last year, #9 hitters just .273).  But I have no idea if they have any predictive power left once you account for batted ball type, location, and the other factors you consider.  Perhaps cumulatively they would add enough to be useful (a FB from Pujols, 3rd time up, 3-1 count has to have a lower out% than the average ball in the same zone).


#29    Tangotiger      (see all posts) 2010/04/08 (Thu) @ 15:24

Guy,

If the average “times through the order” that Adrien Beltre sees as a 3B is the same as the league average 3B, then it won’t matter.  There’s no bias there.

You’d have to focus on areas of bias.  Hitter quality might be one.

The count wouldn’t be.  The range of BABIP from 0-2 to 3-0 is something like .280 to .320.  That would mean that the BABIP level of the count will be somewhere around .298 to .302, for every player.  You just won’t find anything close to lopsided hitter’s or pitcher’s count for a player with more than 20 games played.

That said, it’s a good enough idea, and a simple enough correction to include.


#30    Guy      (see all posts) 2010/04/08 (Thu) @ 15:33

Tango:
Yes, they would pretty well average out for any FT player across all BIP while he’s on the field.  But couldn’t the balls hit into Beltre’s zones, in a single season, be skewed by count, hitter quality, and/or times thru order? Or do you think it would largely be a wash?


#31    Depot      (see all posts) 2010/04/08 (Thu) @ 15:37

I realize my suggestion is a programming nightmare, but maybe some people have suggestions on how to efficiently to it?

I should say that I actually like just using zones.  Local regression should be fine too.  Any refinements after these 2 methods is just nitpicking, but I think the point of this thread it to nitpick so…

My only problem with local regression is that it’s very “black box.” I don’t think it’s equivalent to the idea I’m suggesting, but maybe Colin can jump in and correct me here.  A local regression should generate a “grid” and perform a regression at each point on the grid (weighting closer observation more).  It then “fills in” the points in between each grid point.  This is equivalent to my idea _only_ if your grid includes every single data point.  Is that right? 

Now, should that make a big difference?  Unlikely.  While I’m usually fine with “black box” methods and this is pretty hypocritical of me to say, in this case, I do like the idea of a clear definition - we took the 100 closest fly balls and generated the results from that.  Instead of, the local regression method applied some weights to a set of fly balls, but we don’t have a clear definition of those weights for each case.  You have to make some arbitrary choices with local regression and I’d prefer them to be explicit.  Making an include-the-closest-X-fly-balls cutoff seems like a nice rule.  But I don’t have strong feelings about this.


#32    Rally      (see all posts) 2010/04/08 (Thu) @ 15:44

I use the count that a ball was put into play for TotalZone.  Doing so didn’t make much of a difference, some players moved up or down by a run.  But once it was in there, no reason to take it out.


#33          (see all posts) 2010/04/08 (Thu) @ 15:47

Depot/31, I’m with Colin here.  I don’t understand why you keep wanting to invent something new and difficult when LOESS regression is available and easy to use and does it better.’

LOESS is more!


#34    Tangotiger      (see all posts) 2010/04/08 (Thu) @ 16:31

"But couldn’t the balls hit into Beltre’s zones, in a single season, be skewed by count, hitter quality, and/or times thru order? Or do you think it would largely be a wash? “

Just guessing, but I’ll say that the BABIP would be between .299 and .301 by count for every (no exceptions) fielder with at least 400 BIP.

For times through the order, I’ll say that’s as low as 100 BIP to get BABIP of .299 to .301.

For hitter quality, that one I would do, and that won’t cancel out for a season.


#35    Greg Rybarczyk      (see all posts) 2010/04/08 (Thu) @ 16:34

I was going to say something about hang times (my usual: they are vitally important), but then it occurred to me that although with any zone-based system you have some difficulties (the size of zone tradeoff, for example), you can definitely improve your zones, without even worrying about smoothing, by changing the shape of the zones.

The shape of the zones should be concentric sectors, centered on the average starting positions of the outfielders.  When the circular sectors reach each other and begin to interfere, make zones there that fill in the gaps.

Now, when an outfielder positions himself far from teh average starting position, the sectors will look strange, so you might need to make two sets of zones, one for LH hitters, one for RH hitters.  Perhaps 4 sets, with another fold in the model for power hitters vs. non-power hitters.

Sector-based zones centered on home plate make sense for hitting analysis, but not for fielding analysis.

Obviously, the scenario that makes all of this unnecessary is when we have hang time, initial fielder position and exact landing point data.  When that happens, zones become unnecessary.  (sighs)


#36    Depot      (see all posts) 2010/04/08 (Thu) @ 16:45

Again, I don’t have strong feelings about this.  I use local regression methods all the time, but I do have a bias against them because they’re so “black box.” If I can create a simple easy-to-use-and-understand method/rule instead, I always go with that.  Local regression isn’t magic, but I think people just like to say, “This is totally non-parametric” when it isn’t really.  But, I’m sure it does a great job in this case. 

I guess the main distinction is (1) setting the zones/grid up initially vs. (2) looking at a specific fly ball and creating a zone from that point.  I think Colin is (at least implicitly) setting up a grid and then getting estimates for each point.  This is similar to setting up zones (like MGL) and then getting estimates for each zone.  In that respect, I’m lumping Colin and MGL together. 

I’m suggesting that for each data point (a fly ball), you do some type of local regression or zone centered around that point.  That’s the philosophical difference.  If I want to know the Pr(out) of a fly ball, why start with an arbitrary zone or grid when I can center everything around that very fly ball?  Should that make a difference in reality?  I would actually guess it doesn’t _at all_.  So, I’m not really disagreeing with anyone.  I like Colin’s idea.  I like MGL’s zones.  But if MGL is concerned that the zones aren’t ideal and there’s some concern that the local regression methods are suspect, then it might be worth starting from the idea I proposed and seeing what happens.  If not, that’s cool too.


#37    MGL      (see all posts) 2010/04/08 (Thu) @ 18:01

I like Depot’s method at least as a starting point, other than it being a programming nightmare, probably prohibitive for me (and for Fangraphs that has to update UZR every day, I think).

I think that many of these methods are very similar to one another.  Even a zone based system is similar to Loess, the difference being that Lowess distinguished point within a zone and smooths everything out in order to “connect” the zones.  And Loess is really just an alternative to a (non-linear usually) regression and not necessarily any better, depending on the plot. 

In general, I am not a huge fan of any regression that has more than one indy, since you can’t really visualize or plot the graph (I am not good at “seeing” a 3 or more dimensional graph).

As far as controlling for hitter quality, I already do that (and other controls like count, etc.) to some extent by using parameters such as the speed of the ball (a line drive or ground ball by Pujols is more likely to be classified as “hard” than one by Eckstein).  And this year, I added the power of the batter (as a proxy for the depth of the OF) and speed of the batter (as a proxy for the depth of the IF and how quickly they have to get the ball to first.

So I don’t think much would be added by adding parameters.  And of course the more parameters you add the more uncertainty your add to your result, regardless of which method you are using to compile the results.  With a zone-based system, it is especially dangerous to add parameters, particularly if you simply create more buckets.  (There are 2 ways to add a parameter to a zone based system:  One is simply to double the number of buckets for every parameter you add.  Another way is to leave the buckets alone and then apply an adjustment to all buckets. The second way is preferable if you think that the adjustment applies equally or in some constant fashion to every bucket.  For things that affect positioning of the fielders, it is usually best to use more buckets.  For things that affect catch rate uniformly (or close to it), like a park effect on ground balls based on the speed of the IF, it is best to just adjust everything (even though it is certainly possible that speed of the IF affects infielders disproportionately).

I have another question. When using logit regression, in this case, the dummy variable (0 or 1) would be an out or not, does the relationship between the indy’s and the dep. have to be linear?  Or can you decide what kind of curve you want as with regular regression?  How do you know in logit if your equation is a good fit (R^2?), and if logit only used linear associations, how do you know if your association is linear in the first place?  For example, if I used angle and distance as my indy’s and out or no out as my dep. for a logit regression, how would I know if it is a good model?  With regular regression, with one indy, I just look at the plot - with more than one indy for regular regression, I don’t know how to determine what the best-fit line looks like other than testing a whole bunch of lines and comparing R^2…


#38          (see all posts) 2010/04/08 (Thu) @ 19:35

Looking at the LOESS methodology, I wonder if we run into the “effective sample size” problem that can plague us when we start trying to define more complicated curves.

One of the reason why linear models (or non-linear approximations of a linear model) keep being used in a fallback position is that one of its bugs is also a feature: with a simplistic curve you tend to avoid the over-fitting your sample data.

I’m not saying this necessarily would be a problem with this method, but I’d sure be on the lookout for it. Also it would seem this method would make it harder for future problems to be remedied, but on that I’m not experienced in the method so I’m probably wrong.

Also I’m assuming there are LOESS packages that can handle multiple variables and since Mickey only has three (X, Y, trajectory) here he should be okay. But I would think this method might become more and more of a problem as more variables get added.


#39    Ryan J. Parker      (see all posts) 2010/04/08 (Thu) @ 20:56

You definitely want to share information in some way between zones. SAFE takes things a step further and shares information between players, which is another benefit of the model (basically handling regression to the mean for you), but that’s sort of a secondary goal I’d think.

I’m still learning spatial statistics myself, but perhaps you can figure out a way to model the data in a similar way as SAFE does by using some measure of distance, direction, and/or other predictors available to you. Logit or probit would work fine, and I believe that probit was chosen for computational reasons in the SAFE work. I’ve got no baseball intuition, so I hope this makes sense.


#40    Sal Paradise      (see all posts) 2010/04/08 (Thu) @ 23:18

There’s a great page on LOESS, what it does, and how to do it in Excel (linked to my name)

It also has the code for VBA to do it, which can probably be adapted to whatever language you’re using so long as you write any functions that are missing.


#41    Ryan J. Parker      (see all posts) 2010/04/09 (Fri) @ 00:39

"When using logit regression, in this case, the dummy variable (0 or 1) would be an out or not, does the relationship between the indy’s and the dep. have to be linear?  Or can you decide what kind of curve you want as with regular regression? How do you know in logit if your equation is a good fit (R^2?), and if logit only used linear associations, how do you know if your association is linear in the first place?  For example, if I used angle and distance as my indy’s and out or no out as my dep. for a logit regression, how would I know if it is a good model?  With regular regression, with one indy, I just look at the plot - with more than one indy for regular regression, I don’t know how to determine what the best-fit line looks like other than testing a whole bunch of lines and comparing R^2…”

To help answer some of your questions…

Logistic regression is a generalized linear model, where it is linear in terms of the log odds: log(Pr(out)/Pr(no out)) = (linear terms). If you want something nonlinear then you can use dummy variables (say for each zone). Ultimately you can do a drop in deviance test to see what predictors to include in the model. There are similar ideas for goodness of fit tests. See https://home.comcast.net/~lthompson221/Splusdiscrete2.pdf for some ideas on how to do this (an R/S handbook for Agresti’s Intro to CDA book).

You can also perform diagnostics using graphs. For example, you can plot sample proportions for various predictors against what the model fit says to see if it makes sense or perhaps an assumption (i.e. linearity in a continuous predictor) is causing problems with the model.


#42    John Beamer      (see all posts) 2010/04/09 (Fri) @ 02:26

MGL

I think it is fair to say in the logit model there is no simple measure of fit like R^2. There are many goodness of fit tests that can give you a similar reading in terms of how well the model fit the data but they are harder to interpret. A likelihood ratio is what you use - typically a log-likelihood which effectively give you a chi squared statistic.

****
“If you want something nonlinear then you can use dummy variables (say for each zone).”

As in a GLM you can have exponents or logarithms as dependent variables in a logit.


#43    dcj      (see all posts) 2010/04/09 (Fri) @ 03:02

I don’t know much about regression, but I want to highlight the point Hawerchuk made. There are two reasons to use smoothing. First, even if we know the exact location the ball landed at, we don’t know the “true” league-average out rate for that location. Second, we don’t know the exact location; all we have is the stringer’s estimate.

--

For example, if I used angle and distance as my indy’s and out or no out as my dep. for a logit regression, how would I know if it is a good model?  With regular regression, with one indy, I just look at the plot - with more than one indy for regular regression, I don’t know how to determine what the best-fit line looks like other than testing a whole bunch of lines and comparing R^2…

Something I’ve done in the past is set up a kind of stop-motion movie. To make things simpler let’s say there are two independent variables x,y both of which go from 0 to 100. The ordinary zone system might make a 10x10 grid and calculate the average probability of an out within each little square. Now I have some sort of regression function f(x,y) and I want to see how well it fits the data. First I fix y=5 and plot f(x,5) for x from 0 to 100. On the same graph I plot 10 points representing the zone averages for the little squares at x=5,15,25,...,95. Now I can look at how well the regression curve matches the 10 data points. Next I press enter and the program shows me the same plot for the next row of zones, that is, y=15. In this way I get to look at each cross-section of the 3D graph.

--

I like Greg’s idea of using distance and angle from the fielder’s starting position. If I remember right, that’s what SAFE does. But then there is the problem of estimating the starting position. Not sure how to do that…

--

On a completely different note, I always forget how UZR deals with balls where more than one fielder has a possible play. If it is:

P(caught by CF) = 0.2
P(caught by LF) = 0.3
P(hit) = 0.5

and the CF catches it, do you give the CF +0.5 plays, or do you give the CF +0.8 plays and the LF -0.3 plays?


#44    MGL      (see all posts) 2010/04/09 (Fri) @ 06:06

"Something I’ve done in the past is set up a kind of stop-motion movie. To make things simpler let’s say there are two independent variables x,y both of which go from 0 to 100. The ordinary zone system might make a 10x10 grid and calculate the average probability of an out within each little square. Now I have some sort of regression function f(x,y) and I want to see how well it fits the data. First I fix y=5 and plot f(x,5) for x from 0 to 100. On the same graph I plot 10 points representing the zone averages for the little squares at x=5,15,25,...,95. Now I can look at how well the regression curve matches the 10 data points. Next I press enter and the program shows me the same plot for the next row of zones, that is, y=15. In this way I get to look at each cross-section of the 3D graph.”

I like that!

“I like Greg’s idea of using distance and angle from the fielder’s starting position. If I remember right, that’s what SAFE does. But then there is the problem of estimating the starting position. Not sure how to do that…”

Shouldn’t be too hard.  At the very least you could start with a large zone where the most balls are caught, assume that the starting point is the center of that zone, generate your function and see at what exact point gives you the highest catch rate (the maximum of the curve). Then use that point and start over again by generating a new curve/function.  Do that a few times and voila!

“On a completely different note, I always forget how UZR deals with balls where more than one fielder has a possible play.”

A fielder never gets penalized for a ball that someone else catches, and a fielder gets credit for the difference between the 1 and the catch rate in that zone.  So the CF gets .5 credit if he catches a ball (and the LF gets nothing) and the LF gets .5 if he catches a ball.

For hits, there is .5 debit split between the LF and CF, with the CF getting 40% of that debit and the LF getting 60% since that is their portion of “responsibility” for that zone, based on the .2 and .3 catch rates.  That method minimizes the effect of ball hogging on easy plays.  On an easy play, no one gets much credit regardless of who catches it and regardless of the proportion of catches for the whole league. For example, if in a certain zone or bucket, 90% of the fly balls are caught and the CF catcher 70% and the RF catches 20%, if on one team the CF likes to hog balls, and he catches 80%, rather than 70%, it won’t make that much difference.  If there are 10 balls per season in that zone and 9 of them get caught, the league average CF’er will catch 7 and get .7 credit total (7*.1) and the RF will catch 3 and get .3 credit and for the one ball that drops, the CF will get docked .9*.778 or .7 and the RF will get docked .9*.222 or .2, so the totals will be CF gets 0 and RF gets 0 of course.  For the team where the CF is a ball hog, he catches 8 and gets a total of .8 credit, the RF catches only 1 and gets .1 credit, and for the 1 ball that drops in, they get the same amount of debit each as the league average team, -.7 for the CF and -.2 for the RF.  So their totals are .1 and -.1.  So the ball hogging is not that big a deal.


#45    MGL      (see all posts) 2010/04/09 (Fri) @ 06:12

I should say actually that we want the CF to get more credit in that scenario because we don’t know that he catches 8 and the RF catches 1 only because he is ball hogging.  We have to assume that the CF is a little better than average and the RF is a little worse, hence the .1 and -.1.

If we always assumed that when the number of balls caught in a zone shared by two or more fielders is equal to the league average catch rate in that zone (even if the catch rates by each fielder is NOT equal to the average catch rate for that fielder), that that means that they are both average in that zone but that one of them is ball hogging, then we would just assign everyone a net of zero in that zone. We don’t really want to do that otherwise we would never recognize when one fielder is good and an adjacent fielder is bad in any particular zone even if the total number of balls caught in that zone were average (or better). 

At least that is how I see it.


#46    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 07:31

We resolved the issue of ball-hogging a couple of years ago in this blog.  It sounds like the way MGL described it.  We had a lengthy discussion with scenarios are laid out.  I’ll see if I can find it.


#47    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 07:34

This wasn’t the one, but it’s good.  Post 35:

http://www.insidethebook.com/ee/index.php/site/comments/the_best_defenders_according_to_uzr_of_the_decade/#35


#48    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 07:37

This was the one I was remembering:

http://www.insidethebook.com/ee/index.php/site/comments/the_fielding_system_approach_ive_been_preaching/#41

Posts 41, 43, 51.


#49    dq      (see all posts) 2010/04/09 (Fri) @ 08:14

48 Tango let’s say we have a ball that is

ss 40%
2b 40%
cf 6%
no one 14%

This team has a great ss, and poor 2b and cf

so for 100 plays

the ss gets 55
the 2b get 35
the cf gets 4
and no one gets 6

If you get .14 credit for making the play, and get charged your % for missed plays,

then the ss is 55 * .14 - .4 * .6 = 5.3
the 2b is 2.5
and the cf is .2

A total of + 8, as 94 plays are made instead of 86

So, the team is above average due to a great ss,
but the 2b and cf get credit for being above average even though they are below average.

What am I missing?

Thanks


#50    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 09:51

dq: you aren’t missing anything.  That’s just the cost of doing it this way.

In fact, you don’t even know that they are below average. 

Imagine a SS that gets to 86 balls and the other two get nothing.  The 2B could be above, below, or average.  We don’t know, because this SS is so good.

We’re just trying to do our best with the data.


#51    Guy      (see all posts) 2010/04/09 (Fri) @ 10:05

dq:  the problem with your example is that you’re stipulating we “know” who’s good and who’s not.  If we knew that, we could assign better values.  But we don’t really know how much of the SS’s 55 were because more than 40% of the balls in this zone were in “his” part of the zone, how much because he’s a ballhog, and how much because he really got balls the other two guys couldn’t reach.  So, knowing that the total result was positive, we have to give some of the credit to the other two guys. 

And remember, most plays do not share this much responsibility.  So we’ll have lots of other data that will tell us how weak the 2B and CF are.


#52    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 10:15

Guy: right.  What you COULD do is only calculate UZR on the non-shared zones.  Then, when you look at the shared zones, the split in responsibility changes based on their UZR in non-shared zones.

But, that is one heckavu alot of work, considering that EVERY zone is shared to some exxtent.  And two, the payoff is so low, that trying to get 1 or 2 runs better, when everyone here and in the mainsteam is NOT going to believe that Beltre being +15 runs is necessarily “better” than Rolen at +12 runs.

Furthermore, since we only look at UZR over 2-4 year periods anyway, and we don’t have a Trammell/Whitaker situation to contend with, then it’s much safer to presume that the “other” players in the shared zone are fairly normal.

There’s accuracy, and there’s precision.  I worry about the accuracy, and I’ll leave the fighting over the precision to others.


#53    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 10:17

That’s not to say that we shouldn’t have the discussion.  I just meant that we can discuss it for our edification, but not argue (too much) over whether we need to implement it.


#54    dq      (see all posts) 2010/04/09 (Fri) @ 10:54

52/

if the ss makes 70 plays and the other 2 make 0 the ss becomes below average even though he is making 30 plays more than the normal ss

51/ My problem isnt that I know who is good - I am trying to determine if the system works when you have a good fielder on a bad team and vice versa.

Im also assuming you can get the zones right - that the measurement you get are accurate, otherwise you have bigger issues.

You have 2 items here - how to measure the zones, and then what to do with them once they are measured.

Im trying to determine what happens to the results once you get them


#55    Guy      (see all posts) 2010/04/09 (Fri) @ 11:10

"I am trying to determine if the system works when you have a good fielder on a bad team and vice versa.”

“Works” compared to what?  It’s not perfect, and it will give too much credit to weak fielders playing alongside good fielders.  The question is whether you improve the metric by debiting the other 2 fielders when their teammate makes the play (that’s your alternative).  In that case you’re punishing fielders for plays on which an out is made, and which may very well have been discretionary.  So you have problems either way. 

This is the one big advantage WOWY has over the pbp systems:  it “knows” how many outs a player should typically make.  So in your 70/0/0 scenario, WOWY will love the SS and pummell the other 2 fielders.  That’s a big advantage for WOWY (but against that, WOWY counts a FB to RF as an “opportunity” for the SS).


#56    Brian Cartwright      (see all posts) 2010/04/09 (Fri) @ 12:12

Guy/55 - I’m not clear on your distinction between WOWY and pbp. I always thought I used WOWY on my pbp data. Sum everyone in each bin, then aggregate by player, finally everyone minus a given player equals everyone else.

Yes, it ‘knows’ because you establish a baseline of expected performance, but that is dependent on how you model the data. I can choose zones (even just big ones like LF, CF & RF) to bin the data into, or as PMR does, use xy location and everyone has a non-zero chance of making a play on the ball.


#57    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 12:17

For WOWY, I also track whether it’s a GB, FB, LD, Pop, so it would be that a GB to 1B is counted as an opp to the SS.

But, I keep the WOWY separate (control for pitcher identity/batter hand, control for batter identity/pitcher hand, control for park / bat+pit hand, control for bat+pit hand / batted ball code).  I keep all of them separate, as I really only care to look at careers.

As it turns out, the batted ball code really doesn’t help me much.  I rarely get crazy different numbers.  That’s because a GB hitter is a GB hitter, and a bunch of them are not going to start hitting FB when Jeter is at SS.

***

Anyway, in the case of the SS getting 70 outs, while the other two get 0, such that the total is below the 84 that an average SS+2B+CF would get, then you can do what I said earlier: look at the non-shared zones, and then portion out the shared zones based on that.

So, if a SS is getting 70 outs in a shared zone while the 2B is getting 0, he is also probably getting to above-average outs in the non-shared zones, and the 2B is getting to below-average in the non-shared zones.

You use that information to infer what you might expect when the two have a shared zone.  So, an average 2B with a great SS might get 10 outs, and a average SS with a bad 2B might get 50 outs.

So, to the great SS, his baseline is the “average SS and bad 2B”, of 50 outs.  If he actually got 70, then he’s +20.  And to the bad 2B, his baseline is 10 outs and if he got 0, then he’s -10.  The two together are +10.

Then you have to reconcile that further, because the two together actually got 70 outs when the average was 84.  There’s a further 14 outs of shortfall to account for.

It gets complicated, but that’s the idea behind how to use the non-shared zones to infer the shared zones.


#58    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 12:20

Brian, this is WOWY:

http://www.tangotiger.net/catchers.html

If that’s what you do, then you are doing WOWY.  If it’s not, then you are doing something else.


#59    Brian Cartwright      (see all posts) 2010/04/09 (Fri) @ 12:26

Yes, that’s where I learned it, and from the Jeter examples.

In the catching case, you have to pair players, aggregating every pitcher/catcher combo, then each pitcher’s and catcher’s totals.

Rogers minus Rogers/Carter equals Rogers with every other catcher.

Carter minus Rogers/Carter equals Carter with every other pitcher.

The sum of how each pitcher worked with every other catcher is WithOut.


#60    dq      (see all posts) 2010/04/09 (Fri) @ 12:32

#55/ Better put - I am trying to understand how it works.

Maybe more realistic with a lot less shared plays:

So, if I have a CF and his plays are 70% for him, 10% for the LF and 20% no one, and he catches 74 of them and the LF gets none, the CF rates worse than the LF

You don’t debit someone when their teammate makes a play - those are the credits, which work okay.

I think you do need to allocate the debits differently. Who do you charge when plays are not made?


#61    Guy      (see all posts) 2010/04/09 (Fri) @ 12:54

Brian/56:  I just meant that with PBP data, you have to distribute some of the blame for bad fielding to surrounding fielders (and vice versa).  If Jeter misses a ton of balls, the 2B and 3B will look worse.  When a ball goes thru the hole, it doesn’t know that—in this case—it’s more likely the SS’s fault.  It just applies the average out probabilities.  (Unless you use the approach Tango is suggesting, which incorporates the ratings of adjoining fielders.)

WOWY knows that Jeter should make X number of outs, given the pitchers and hitters on the field.  He will get debited for any difference, with none of the penalty given to his teammates.  (However, I think OF ballhogging may still be a problem for WOWY.)


#62    Tangotiger      (see all posts) 2010/04/09 (Fri) @ 13:08

Right, you can do a WOWY with Jeter/Cano, Jeter/Soriano, Jeter/etc, and see how many plays Jeter makes in the shared (and non-shared) zones.

This process is what Rally and MGL did for their “scooping” study of 1B I believe, whereby the pairs of players were studied in this manner.

Whenever you have two players that are heavily involved in a play, I can’t think of a better way to do it than WOWY.


#63          (see all posts) 2010/04/10 (Sat) @ 01:10

Every game is broadcast. Why guess where the fielder was positioned? If you are trying to make a fielding metric, stop guessing and do it right. Get someone to plot each play on an x/y axis on the actual field.


#64    Colin Wyers      (see all posts) 2010/04/10 (Sat) @ 01:15

Because it’s a lot of work - I mean, a lot of work. Because the broadcast view often doesn’t cut to the fielder until they’ve already started moving on the play. Because it’s one more data point that suffers from unquantifiable observer bias.


#65    MGL      (see all posts) 2010/04/10 (Sat) @ 04:44

"If you are trying to make a fielding metric, stop guessing and do it right.”

Who are you talking about?  It would be nice if someone at the game did that.  But they don’t.  As Colin said, you can’t do that from video.  If you (TPF) want to volunteer to go to every game and start doing that, be my guest.  You’d have to clone yourself first and that would be expensive.

And I’m not really sure we need to know (so much) where fielders are positioned on every play…


#66    Guy      (see all posts) 2010/04/10 (Sat) @ 08:20

And even if you had the data on where the fielder was positioned, you wouldn’t use it to calculate the out probability of individual plays.  Part of what we want UZR to measure is good positioning --not just whether they player is good at getting the ball given where they happened to start.  If a player has made a play more challenging than it needed to be, due to poor positioning, we certainly don’t want to reward him for that.  Knowing their positions would let you precisely estimate the average position for that particular base/out/handedness context, but it seems like there are other ways to get that.  I suppose it would also let you break down UZR into components, positioning and ‘getting the ball’, but it isn’t needed to measure overall fielding.


#67    Colin Wyers      (see all posts) 2010/04/10 (Sat) @ 10:07

Right.

Broadly speaking, there are two elements to fielding positioning. One is strategic concerns (holding the runner on, playing in for the potential bunt, playing back for the double play, etc.) You want to control for that when figuring out the value of a play. The other is setting yourself up in the best position to catch the ball. You don’t want to control for that - that’s part of what you’re trying to measure.


#68          (see all posts) 2010/04/10 (Sat) @ 11:16

Re 65/66/67, that’s true for an overall fielding metric like UZR, not so sure that’s true for a club that’s trying to use fielding data.  You’d like to know those things separately because positioning is easy to change/improve/coach while the other things like range and arm probably are less so.


#69    MGL      (see all posts) 2010/04/10 (Sat) @ 16:16

#68, right.

Plus you want to know the average positioning for the various kinds of batters, L/R, power, non-power, etc., in order to have accurate UZR’s.  So basically once you set the average position of each fielder for each situation/batter, etc., at that point you don’t care about the position of the fielder since that is included in “fielding skill.”

BTW, a ‘reverse UZR’, using the advanced methodology we are talking about in this thread, can be used to set fielder positioning for any batter once you know a batter’s projected spray chart…


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 14:49
Mail: rWAR v fWAR

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 13:00
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 12:05
Could Rob Dibble have been a comp for Strasburg?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?

Sep 01 23:16
Strasburg II

Sep 01 22:11
PITCHf/x Summit 2010 - Recaps