THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Saturday, September 18, 2010

Reliability of UZR

By Tangotiger, 10:49 PM

Someone asked, I replied:

400 balls in play for fielding stats is as reliable as 200 PA for a hitter in batting stats.

For a SS, 400 BIP is 80 games.  For a corner OF, 400 BIP is 130 games.

200 PA is a bit under 50 games.

So, yes, I definitely said 2-3 years of fielding stats is as reliable as 1 year of hitting.


#1    Colin Wyers      (see all posts) 2010/09/19 (Sun) @ 00:08

Are we talking about “reliability” in the sense of autocorrelation here? (Split halves, year to year, etc.)

Because that doesn’t really answer any of the pressing questions, does it? What we know is that UZR (or any other defensive performance metric - DRS, TZ, etc.) will “stabilize” at a certain level, and so that players who UZR assigns a high numerical value to in set A will tend to be the players UZR assigns a high numerical value to in set B.

What we don’t know is how well UZR is doing in assigning those numerical values to the players who actually were good or bad at fielding baseballs in either set, and so we don’t know how much of the persistence of UZR is the prediction of a player’s actual fielding ability.


#2    Matt Swartz      (see all posts) 2010/09/19 (Sun) @ 00:44

I’m having the issues as Colin here.  In other words, there is a different between NOISE and BIAS. 

Noise is deviations in small sample size that go to zero as the sample size approaches infinity. 

Bias is deviations due to persistent effects on measurements that do not change as the sample size approaches infinity.  In fact, like skill differences, their effect on correlation grows as noise goes to zero.

I think a lot of sabermetric statistics should be re-framed to distinguish noise and bias.  Especially with fielding statistics.  The question of whether to use a UZR-type metric or a TZ-type metric is a matter of measuring noise (probably larger in TZ) versus bias (probably larger in UZR).  That is the argument we should be having, and the calculations we should be doing.


#3          (see all posts) 2010/09/19 (Sun) @ 03:22

How exactly do we know that 3 years UZR is as reliable as 1 year hitting. 

And assuming we know that, is their a number to go along with that reliability claim, like UZR over 3 years = 30 RAA +/- 5 Runs?

With hitting, we have play by play, where we can see the outcome of each PA.  A K, BB and HR are defined by the rules of baseball.  A BIP is either a hit, out or an error.  What kind of hit depends on how many bases he gets (excluding advancing on an error).  Aside from judgments by official scorers and umpires, we can validate each and every hitters offensive stats.  And we have many hitters season stats to see how consistent they are from year to year and as hitters age, pretty reliable stuff going back 50-60 years, longer for some of the more basic stats.

With UZR we get one cumulative number at a moment in time during the season, or at the end of the season, and have only had UZR for less than 10 years, and the new and improved version is in it’s first year.  Of those 400-800 plays for a given player in a given year, we can not determine what value was assigned to any play, let alone every play.

If Adrian Beltre makes a great diving catch of a LD down the line with 2 outs and the bases loaded, we have no way to see how that play contributed to his UZR.  No way to validate UZR that I can see.

I can look at 200 PA for a given batter and say his production over those 200 PA was pretty good or bad, even if it tells me little about his true ability as a hitter.

MGL has said UZR in SSS will not tell you how a player fielded (regardless of ability), and depending on the alignment of the stars, it is either useless or not very useful, or is only useful after 1 year (still SSS), and reaches it’s maximum usefulness at 3 years, after which the player is now 3 years older than at the start, and may not be the same fielder his 3 year average says he is.

Logically, if UZR has any value after 3 years, and we take it to be so as a matter of faith and the fact that high/low UZR players tend to correlate pretty well with our eyes, except for 1B and OF’ers at Fenway, then it must have just a bit of value in smaller samples. 

So if a fielder has a high or low UZR in a small sample, it must (should) mean he is not getting to as many balls as he should, or he is getting to more balls than a league average fielder.  Just like a Mendoza line hitter can hit 350 for a month, or a 900 OPS guy can have a 500 OPS for a month.

This could be due to luck, balls hit at the fielder or just out of his reach (just as a batter may have a large or low BABIP due to luck), but this has some value to those looking at how a players position has impacted his team.

For example, if someone claims that to his eyes a perpetual GG’er called Player A has been terrible at 3B over 40 games, and his UZR has been high, that Player A may or may not have been terrible, we can’t say, but according to the data, his play did not cost his team much, maybe because he was pretty lucky in his positioning.

Now this may not be predictive, but folks get too caught up with that, some of us like to look at stats to measure what a player has done and his contribution to his teams wins, even if it does not reflect his true ability and has no predictive value.


#4    tangotiger      (see all posts) 2010/09/19 (Sun) @ 08:15

r=.50 when BIP=400, and PA=200.

Agreed on potential for bias.  Source of bias would be stringers, parks, batters, and pitchers, correct?  We can account for the last 3 easily enough.  For stringers, we can also focus just on road data.


#5          (see all posts) 2010/09/19 (Sun) @ 09:58

For stringers, we can also focus just on road data.

That wouldn’t address any issue of what Guy (I believe) termed range bias, i.e., balls systematically being scored as closer to the fielder who fielded them than they really were.

Not to mention that you’re now talking about looking at a six-year sample.


#6    Tangotiger      (see all posts) 2010/09/19 (Sun) @ 10:40

I agree that there is a definite bias in terms of whether the player recorded the out or not (either in location, or distinction between FB/LD).  I think I may have been among the first to note that.  Thanks for reminding me of that systematic bias.


#7    Colin Wyers      (see all posts) 2010/09/19 (Sun) @ 12:02

And to the extent that those biases are persistent across the data set (year to year or split halves), isn’t that showing up in the correlations? So isn’t the best assumption we have that UZR (as a measure of fielding ability) lower than those numbers suggest?


#8          (see all posts) 2010/09/19 (Sun) @ 15:13

Colin,

Having looked the scorer bias phenomenon in hockey, I share your concern.

There are two types of bias, home team bias (+ or -) and counting bias.

If there are occasions with more than one stringer recording a game, an inter-rater reliability test would be a good first check.  So long as both stringers are cool with that, they’re doing thankless work, after all.

In hockey we used Cohen’s Kappa test.  A simple online Kappa app is at timeonice dot com slash cohen dot html if you’re interested.  If you’re looking at “line drives counted relative to balls hit to the outfield” ... then the scoring chances field would be line drives and the shots directed at net field would be batted balls that reached the outfield airborne.

Kappa’s specific reasoning can be applied to a multinomial problem as well (such as fly, liner or fliner?), though I’ve never had cause to write an app for that.


#9          (see all posts) 2010/09/19 (Sun) @ 15:32

Colin

Further on the same subject, in hockey there is an interesting phenomenon.  Methods fueled by the second moment, or error^2, such as the Z-score method, will show little or no scorer bias in the general population of stringers.  Using absolute error shows dramatic scorer bias, and using (abs error)^3 implies negative scorer bias ... i.e the universe is collapsing.

I don’t know why this happens.  I suspect that guys who are recording very high levels of a particular event (say missed shots) become aware of it, and they subconsciously started recording ‘tweeners as on-goal.  Also there is obvious home bias in the general population, but not enough to confidently pin a value on any scorer in particular.  This despite just 30 scorers in play.

Also, scorers with bad goalies over several years tend to start becoming softer markers on shots as time wears on.

Do MLB scorers who record for teams that make a tonne of errors ... do they become softer markers over time?  I don’t know, but at even odds, with God as the adjudicator ... I’d put everything I own on “YES” at even odds.

This is a subject that is probably better suited to someone with a background in psychology, and that’s not me.  I do find it fascinating though, I’d appreciate any links to articles written on the subject regarding MLB.  It’s important, methinks.


#10          (see all posts) 2010/09/19 (Sun) @ 15:58

Colin,

How does UZR predict TZ?

Specifically, if we take 20 random road games from a team’s schedule in 2008 (or whatever year) and calc the team UZR and TZ ... then select another 20 random games from the remaining ones in the same season and do the same ... what is the correlation (just Pearson’s r for now) across the samples.?

If this is done 10000 times, what is the average r for home UZR to future home TZ? (and I’m being flexible with the notion of time here).

The same for all combinations, average Pearson’s r for road UZR at the team level to home TZ at the team level etc. UZR2UZR road, TZ2UZR home/road, TZ2UZR road home, etc.

It’s a starting point, but that should give us an arrow in the right direction, no?  The inherent bias in the UZR methodology, if it exists, should start showing itself. 

Makes sense, no?


#11    Tangotiger      (see all posts) 2010/09/19 (Sun) @ 19:07

Colin/7: yes, definite.  If you have a systematic bias, then the correlation will capture that.  I don’t know to the extent that it does.  Just taking a guess that if r=.50 when BIP=400, that it would probably be r=.49 for fielding talent and r=.10 for the systematic bias.

To those confused by the math: .50^2 = .49^2 + .10^2


#12    pm      (see all posts) 2010/09/20 (Mon) @ 02:39

my problem with UZR is the 3 years needed for reliability. A player could start this season at age 29 and by the time he has enough reliable data he will be 32 years old. Chances are he declined defensively so it becomes hard to tell what is his true talent: the 3 year average or the declining UZR ratings. Plus 1 year of batting data is still unreliable to me. Aubrey Huff is one of the better hitters this year after having a -1.3 UZR season last year. If we had a guy go from -25 UZR in a 3 year sample to a +20 UZR in a 3 year sample, you would look at it fishy.


#13    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 12:50

Okay, just musing aloud here. What I did was simulate 650 “forecasts” or “true talent levels” for some pseudo ZR stat, where the average was .710 and the SD of talent was +/- 10 per 133 chances, or 0.075. (133 chances is 400/3, so in order to keep with what Tango was presenting above.)

The correlation between this and the “results” the next season (the forecast with random error added in; 0.039 = 1 SD) was .89 (or so - I ran it a few times but it stays pretty steady). So looking to year to year correlation, the best we can hope for is 79% of the variance explained by regressed year one. (Again, given some pretty crude starting assumptions - you can play with the inputs and come up with some different values.)

So now let’s add in random noise to the year one values as well - let’s stop calling it year one, and call it the “prior,” since we’re going to see how increasing sample size changes the values. I’m listing the BIP used to figure random error, then the correlation:

133 CH: .77
266 CH: .81
400 CH: .83
532 CH: .85

Those are all “rough” values - I ran a handful of tests, but not enough. Anyway it’s a crude exercise and these are really just for illustration purposes. Random variance seems to reasonably quickly approach the theoretical “best” value. Now I’m holding true talent constant from the priors to the following season, which of course isn’t how it really works.

So anyway. Let’s say, again just illustrating, that 69% of the variance in our pseudo-ZR metric in the season of interest is explained by the variance in the past three seasons of pseudo-ZR, compared to 66% for just the previous two seasons or 59% for the past season alone.

Of the covariance - which is to say, the variance not explained by randomness - we can roughly break it down into three components:

1) The individual player’s skill at fielding.
2) Environmental effects (park effects on *fielding*, the effects of teammates on positioning, etc.)
3) Measurement error (observer positioning bias, range bias, caught/not caught bias [those two are overlapping but not identical], etc.)

And the thing is - there’s really no reason to think the proportion of those three things, relative to each other, changes as sample size increases, is there?

So the effects of scorer bias become *greater*, not lesser, as you increase your sample size and you reduce the effects of randomness. In Tango’s example above, he’s suggesting that if 25% of the variance in the season of interest is explained by variance in the previous three seasons, then 1% of that may be explained by data bias issues. I think from what we’ve seen in comparing various defensive metrics, the magnitude of the effect is probably greater than that.

Let’s split it in half, just to make the math simple. We’ll go to the previous example, and just say that roughly 44 percent of the variance in the season of interest not in common with the previous seasons is due to aging (that’s a total bullsh!t number, but I’m just trying to illustrate this), in addition to the variance explained by randomness. So we end up with this:

3: 25%, 2: 22%, 1: 15%

That implies a year to year correlation of about .4 for pseudo-ZR, which really isn’t too far off from what you see with defensive stats. We’re in the ballpark here, at least. So that would give you variance in season of interest explained by bias in prior years, again presuming a 50/50 split for no good reason: 12.5%, 11%, 7.5%.

So what we can see is that we should expect the effects of bias in fielding metrics to get WORSE as we increase our sample size (or regress to the mean, which is really just an approximation of how increasing sample size works at a population level). That’s why statistical reliability doesn’t tell us much about the usefulness of UZR (or any other defensive metric based on subjective data), in the absence of information on the magnitude of the effects of bias relative to player skill.


#14    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 12:52

Vic, those all sound like great ideas for tests to run - and I’ve thought of a few others besides. But without access to game-level UZR splits (or a second source of batted ball data, other than Gameday), though, I can’t run those tests.

Broadly speaking, though, I don’t know that interrater reliability tells us the whole story - I can think of plenty of reasons that two raters would agree without either of them being “correct.”


#15    Tangotiger      (see all posts) 2010/09/20 (Mon) @ 13:26

per 133 chances, or 0.075. (133 chances is 400/3

You mean 133 games, right?  Basically, there’s about 4 BIP per game per (the 7 non-C, non-P) position.  400 BIP is about 100 games (80 for SS, 133 for corner OF).


#16    Tangotiger      (see all posts) 2010/09/20 (Mon) @ 13:27

I also have to re-read startign from here:

And the thing is - there’s really no reason to think the proportion of those three things, relative to each other, changes as sample size increases, is there?

I got lost here, so I have to try to understand this.


#17    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 13:33

You mean 133 games, right?  Basically, there’s about 4 BIP per game per (the 7 non-C, non-P) position.  400 BIP is about 100 games (80 for SS, 133 for corner OF).

I got a little confused about the denominators - I read “if r=.50 when BIP=400” as meaning that three years of defensive data meant 400 CH. It doesn’t change the point any, just the individual numbers being bandied about. Correlations if I change those parameters:

True: .85
400: .75
800: .78
1200: .79
1600: .81


#18    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 13:35

Think of it in these terms. Variance in pseudo-ZR can be explained by:

Variance in skill + variance in environment + measurement error + randomness

So if you strip out the random element, the question is if:

Measurement error/(variance in skill + variance in environment)

ever changes as a result of an increase or decrease in sample size.


#19    Tangotiger      (see all posts) 2010/09/20 (Mon) @ 13:39

"It doesn’t change the point an”

Right I agree.

To be clear, the reliability numbers are BIP=400 : PA=200 : r=.50

PA=200 is 50 batting games.  BIP=400 is 100 fielding games.

So, if you have 150 batting games (one seasons), that level of reliability would be 300 fielding games (2 seasons).

For corner OF, BIP=400 -> 130 games.


#20    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 16:04

Or to put it another way.

Is there any evidence that UZR over three years is more descriptive of what occurred in those three years than UZR in one year?

Or a third way, which is the second way but a bit broader - does the quality/usability/etc. of the batted ball data increase with sample size?


#21    Tangotiger      (see all posts) 2010/09/20 (Mon) @ 16:52

Is there any evidence that UZR over three years is more descriptive of what occurred in those three years than UZR in one year?

If you run a 2008/2009 regression against 2010, and we get an r greater than 2009 onto 2010, would that satisfy that requirement?  If not, what is the greater r telling us?


#22          (see all posts) 2010/09/20 (Mon) @ 18:11

Tango/21, the correlation between one season and the next includes the correlation due to bias.


#23    Colin Wyers      (see all posts) 2010/09/20 (Mon) @ 20:20

If you run a 2008/2009 regression against 2010, and we get an r greater than 2009 onto 2010, would that satisfy that requirement?  If not, what is the greater r telling us?

R tells us how much of the population variance is shared between two sets of data. That’s it. 2009 is more likely to have fielding skill in common with 2010 than 2008 is, but it is also more likely to have bias in common (assuming we have a decent idea of the potential sources of bias).

In other words, what the year to year correlation of UZR tells us is how well UZR predicts UZR. Well, what good is knowing (or estimating, I should say) future UZR? And that really depends on how well UZR describes what happened, doesn’t it?

But let’s say we have a player whose defensive performance we’re interested in. In one season, he puts up a +15 UZR in 150 games. How confident are we that his performance (not his ability) was worth 15 runs to his team? Say we now have three years of data, and it prorates out to +15 runs per 150 games. How confident are we now that he performance has saved 15 runs for his team per season?

And knowing the likelihood that he will have a +15 UZR is 150 games the next season doesn’t answer that question at all - everybody agrees that with three years of data, we can be more confident in predicting what his UZR will be in the future. It’s just until we know how good a descriptor of defense UZR is, knowing its ability to predict itself is meaningless.


#24    tangotiger      (see all posts) 2010/09/20 (Mon) @ 20:42

Mike: yes.  But, what’s your point?

***

In other words, what the year to year correlation of UZR tells us is how well UZR predicts UZR. Well, what good is knowing (or estimating, I should say) future UZR?

This doesn’t apply just to UZR, but anything.

UZR, year1 = talent, year1 plus noise plus bias
UZR, year2 = talent, year2 plus noise plus bias

So, the “good” to know is that talent is part of the equation.

I’m not sure I understand the objection here.  Is your objection to ANY year-to-year correlation, or just to UZR?


#25          (see all posts) 2010/09/20 (Mon) @ 21:42

We know that fielding measurements are especially susceptible to bias.  Indeed, for reasonable sample sizes, i.e., season or more, we are concerned that bias may be a bigger factor than random noise.

Year-to-year correlations are great measures when you are mostly concerned about the size of the noise relative to the signal.  When you are more concerned about persistent bias than about noise, year-to-year correlations can give you a incorrect sense of what is going on.


#26    tangotiger      (see all posts) 2010/09/21 (Tue) @ 00:01

i.e., season or more, we are concerned that bias may be a bigger factor than random noise

Bias will overtake random noise after several seasons.  I don’t know how many, but something around 5-10 seasons I’d guess.

Mitigating that would be that players move around alot.  Not to mention that some of the bias is systematic toward good players and bad players.


#27    Colin Wyers      (see all posts) 2010/09/21 (Tue) @ 00:46

All this hinges on how much bias there is, doesn’t there?

Okay, a quick little study. I took the RZR data from Fangraphs, as well as BIP data from Retrosheet, and mapped the two together. From here, when I refer to BIP, I mean team BIP, not prorated BIP like Tango was referring to before.

I took two quantities: BIZ per BIP, and plays made (Plays plus OOZ) per BIP. For players who were on the field for at least 200 BIP in consecutive seasons, the year to year correl for BIZ/BIP and PM/BIP:

1B: .14, .15
2B: .16, .22
3B: .16, .16
SS: .27, .24
LF: .39, .15
CF: .28, .28
RF: .42, .23

Those corner OF numbers are scary, aren’t they? Corner outfielder opportunities appear to be a much more persistent “skill” than corner outfielder plays made!


#28    tangotiger      (see all posts) 2010/09/21 (Tue) @ 01:27

Yes, those are fascinating numbers!  All of those numbers are very bothersome.


#29    joe arthur      (see all posts) 2010/09/21 (Tue) @ 05:55

Colin,

did you limit your study in #27 to 2006 and onwards? RZR went through methodology changes. In 2004 and 2005 for LF and RF, BIZ were identified more generously (50% more in 2004-2005 than in 2006-2009) and in 2004 only, plays made were in a .94:1 to putouts, whereas play made are roughly equal to putouts from 2005 on.


#30    Peter Jensen      (see all posts) 2010/09/21 (Tue) @ 07:01

I took two quantities: BIZ per BIP, and plays made (Plays plus OOZ) per BIP. For players who were on the field for at least 200 BIP in consecutive seasons, the year to year correl for BIZ/BIP and PM/BIP:

Colin - I am not sure that I am completely understanding what you did here.  The BIP is still Team BIP?  If so, when you say that you included players that had 200 BIP in consecutive years that would be all players that played in about 8 full games in consecutive years?  Isn’t this just showing what a lousy proxy Team BIP makes for chances in a defensive metric for players with a small sample size?  How does this relate in any way to bias?


#31    Colin Wyers      (see all posts) 2010/09/21 (Tue) @ 09:36

Joe, I looked at BIZ per BIP at the league level and 2003 seemed like the only severe outlier (I dropped it from the study). I’ll look at that and get back to you.

Peter, I’m not sure how including low-BIP players in the study would increase the correlation.


#32    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 10:04

Now that I’m awake. 

Colin’s study is creative.  Let’s look at what it’s trying to do.  If there is no stringer bias, we would expect that the number of balls in zone would have no correlation year-to-year on that basis.  We would expect correlation based on batters and pitchers and possibly parks. 

If you look at each 3B, they probably sees a random enough set of batters that there would be no correlation year-to-year.  Basically, we don’t expect Zimmerman to face a disproportionate number of RHH who will pull the ball compared to Longoria.

With parks, while a bit bothersome (much moreso for OF than IF), let’s dismiss it for IF.

That leaves us with pitchers.  So, when you see Colin’s data for correlation of balls in zone per ball in play for infielders:

1B: .14
2B: .16
3B: .16
SS: .27

We don’t get zero, possibly because of the pitchers.  And/or the stringers.

In order to handle the pitchers effect, we should look at fielders who switched teams.  Or run a correlation by splitting the data (in-season, or combining two adjacent seasons) by looking at SP v RP for the same team.  I’m sure there are other bright ways you can do it (correlate all the A-L pitchers of the two seasons to the M-Z pitchers).

If you do that, we should have r=.00.  If you still end up with an r above zero, then we have possible stringer bias.

***

I’m not sure that what Colin is doing here proves what he’s saying:
LF: .39, .15
CF: .28, .28
RF: .42, .23

As I noted, you will get a high correlation because you’ve got the parks issue.  The “zone” will change from park to park, I presume.  I don’t know how many square feet (the in-zone) Fenway’s LF is, but I’m sure it’s different from (the in-zone) Coors’ LF.  And even if it wasn’t, I don’t know that we should expect the same number of BIZ for each LF park.

So, I think Colin is on the right path here.  But, I don’t think he’s isolated bias from the signal.  Yet.


#33          (see all posts) 2010/09/21 (Tue) @ 10:19

In order to handle the pitchers effect, we should look at fielders who switched teams.

Tango, I believe I reported that already, both at THT (from Twitter) and reposted here, unless we are talking about something new that I’m not following.

http://www.hardballtimes.com/main/blog_article/from-twitter-uzr-and-plus-minus/
http://www.insidethebook.com/ee/index.php/site/comments/state_of_fielding/#12

CW: Can we look at year-to-year correlation for ExO in players who DON’T switch?
MF: I can’t find ExO on Fangraphs any more, is it okay to use BIZ?
CW: Hrm. Looks like they pulled DG as well. I don’t think BIZ quite gets at what I’m measuring, but it could work.
MF: Team-switchers BIZ/inn for outfielders y-t-y correlation R^2=0.06, for non-switchers R^2=0.18.
MF: Team-switchers BIZ/inn for infielders y-t-y correlation R^2=0.55, for non-switchers R^2=0.65.
MF: The difference between BIZ and ExO being ExO includes a measure of difficulty, e.g. how hard the scorer thought the ball was hit? Is that what you’re saying?
CW: Yeah. Still, I wonder how much of the BIZ correlation is pitcher tendencies and how much is scorer bias.


#34          (see all posts) 2010/09/21 (Tue) @ 10:22

Oh, I see, Colin did BIZ/BIP, whereas I did BIZ/Inn, which isn’t quite the same thing.

For team-switchers, it should still be close to zero, but for players on the same team the strikeout tendencies of the pitching staff will matter more per inning than they would per BIP.

So my earlier data should still be a valid example for team-switchers.


#35          (see all posts) 2010/09/21 (Tue) @ 10:33

And there’s some discussion in the previous thread of other improvements that could be made to my comparison, such as normalizing per position, etc.  I had forgotten some of that discussion.  Carry on.


#36    Peter Jensen      (see all posts) 2010/09/21 (Tue) @ 10:35

Peter, I’m not sure how including low-BIP players in the study would increase the correlation.

Colin - I don’t see any evidence that including low-BIP players in the study does increase the correlation.  I am not sure what the correlation is supposed to mean.  I asked 4 questions in my previous post. I really did need clarification on how you were performing the study and how you were connecting the results to bias.


#37    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 10:52

MF: Team-switchers BIZ/inn for infielders y-t-y correlation R^2=0.55, for non-switchers R^2=0.65.

Well, this is quite alarming, isn’t it?  I can’t believe I didn’t address it.  r-squared = .55 means r=0.74.

While you used inning, and we should use BIP, we should expect similar results because we are looking at team switchers.

Now, what Mike is saying is: the correlation of balls in zone per ball in play is 0.74 year-to-year for guys who switched teams.  This is not balls fielded, but simply balls marked as reasonably playable for an out!

This is such a huge number, this 0.74, as to be, well, so unbelievable that it is unbelievable.  I mean, I get r=.74 for batting stats for players who have 500 PA.  It is that strong and powerful.

This is what I’m going to guess Mike did: he did not focus on position by position.

That is, imagine that there is no stringer bias at all.  And we find that Longoria has 0.13 balls in zone per ball in play with Tampa, then he’s traded to the Sox, and he gets 0.12 balls in zone per BIP.  And we have Ichiro with 0.10 balls in zone per BIP who gets traded and with the new team he has 0.105 BIZ per BIP.

You can see that there is a definite position bias.

What you want to do is run a correlation on a position by position basis.  That is, for all the team-switching 3B, what is the y-2-y correlation of BIZ per BIP.

I’m going to guess r will be .10 or less.


#38          (see all posts) 2010/09/21 (Tue) @ 11:00

Yeah, Tango/37, we covered some of that in the previous thread, as I realized after I reposted.

I can pretty quickly give the numbers position by position, but what I don’t have the data for is team BIP, all I have is defensive innings.  Colin has the team BIP data, I believe.


#39          (see all posts) 2010/09/21 (Tue) @ 11:05

For example, for shortstops, the correlation in BIZ/Inn is r=.22 (R^2=.05) for team-switchers and r=.49 (R^2=.24) for non-switchers.

For second base, r=.14 for team-switchers and r=.22 for non-switchers.


#40    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 11:06

Well, you can get pretty close by just doing:
(INN*3-SO)*.95+H-HR
That is, about 95% of the non-K outs are batting outs.

But, as you noted, since we are looking at team switchers, we can just stick with innings with little loss of accuracy.


#41          (see all posts) 2010/09/21 (Tue) @ 11:11

For third baseman, correlation in BIZ/Inn is r=.06 for team-switchers and r=.34 for non-switchers.

For first base, r=.04 for team-switchers and r=.08 for non-switchers.


#42          (see all posts) 2010/09/21 (Tue) @ 11:16

Tango/40, the problem is in figuring out those numbers for when each fielder was on the field.  It’s not simply the team BIP totals for the year, it’s the team BIP totals when that fielder was on the field.

Yes, I could build a query to do that, and if I’m doing that I might as well query actual team BIP, but Colin already has that, I think.


#43    Colin Wyers      (see all posts) 2010/09/21 (Tue) @ 11:37

Doing as Joe suggests, I restricted the seasons of interest to 2006 through 2009:

Pos    BIZ    PM
1B    0.02    0.17
2B    0.14    0.22
3B    0.15    0.21
SS    0.23    0.25
LF    0.06    0.08
CF    0.22    0.18
RF    0.16    0.24

Severe changes seem to be restricted to 1B, LF and RF.

Yes, this is all a very clumsy way to study the issue. The trouble is that we don’t have any non-clumsy ways to study any dataset other than Retrosheet/Gameday.


#44    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 11:42

Mike: got it.

Can you also show the number of players, and the sum of min(INN1, INN2) for each position?

Are you restricting a minimum playing time, or are you weighting the correlation based on the innings?


#45          (see all posts) 2010/09/21 (Tue) @ 11:59

Tango/44

For team switchers

Pos    n    min-Inn
SS    23    22912
3B    21    17774
2B    26    21132
1B    21    17387

I restricted to minimum of 400 innings in both seasons.  I did not weight the correlation based on innings.


#46          (see all posts) 2010/09/21 (Tue) @ 12:03

For non-switchers

Pos    n    min-Inn
SS    114    114482
3B    122    113092
2B    113    103125
1B    119    110763

Min 400 innings


#47    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 12:11

Great stuff.  Ok, for team-switchers, we have these correlations (r), number of players (n), and number of full seasons (minimum innings divided by 1458)

Pos    n seasons r
SS    23    16 .22
2B    26    14 .14
3B    21    12 .06
1B    21    12 .04
---   --    --  -- 
Tot   91    54 .12

The high r for the middle-infielders is bothersome.  I presume that since 3B and 1B don’t play too far off the line, regardless of who the player is, but SS and 2B can shift a great deal, that the marking of plays for the middle infielders is subject to the “did he make the play or not” bias.

This is huge if true, and definitely deserves further study with more data.

It will also be good to know if BIS and STATS has the same bias.  Ben, MGL, anyone else out there want to try?

Great job!


#48          (see all posts) 2010/09/21 (Tue) @ 16:14

Tango

Just to further Colin’s point regarding autocorrelation.  And knowing that you follow hockey ...

As you know, a lot of folks have attempted to define shot quality in NHL hockey.  The parallels between that and MLB defense are enormousthough hockey has the distinct advantage of an end result for comparison (actual shooting% and save%).

Chris Boersma, Gabriel Desjardins, Graeme Johns and Alan Ryder of course, many others have researched this.  The writer that I think captures the problem is Krzywicki.  This paper is terrifically honest
http://www.hockeyanalytics.com/Research_files/Shot_Quality_Krzywicki.pdf

The season to season autocorrelations are terrific.  (Ryder/Kryz blend here)

06/07 to 05/06—r=.61
05/06 to 04/05—r=.77
04/05 to 03/04—r=.58
03/04 to 02/03—r=.50

Ken used logistic regression, and he factored in a tonne of minutae that had been ignored by other methods:

Type of shot (slap, wrist, snap, deflection)
Time since a giveaway
Weighted rebound/scramble effects
Game situation
etc.

All were reasonably applied.  Ken knows the game.  Somewhat paradoxically, this was damaging.  The more detail added, all this extra information applied in a reasonable way by someone who understood the game ... it made it worse.

This because the effect of the rogue scorers, for any particular stat, became exaggerated with his logistic regression.  That was the cause of the high autocorrelations. 

The predictive value in road games, at even strength, vanishes to virtually nil.  Telling us that Ken has inadvertantly concentrated scorer bias with his model.

So with the several defensive metrics for MLB, personally I tend to lean away from the ones with the highest autocorrelation.  This until I have seen home/road splits for the stat.

Fangraphs doesn’t seem to want to give me home/road splits for UZR.  Probably because I’m doing something wrong.  I have no idea where some of you cats get your wonderful numbers.


#49          (see all posts) 2010/09/21 (Tue) @ 16:25

Also, google just isn’t being my friend.

Are there places where the methodologies used to calculate TZ, UZR, DRS are explicitly defined?



#51          (see all posts) 2010/09/21 (Tue) @ 18:30

Thanks Tango, though for the life of me I can’t find the fielding zones on retrosheet.  I apologize in advance, I’m sure it will be obvious in hindsight.

In any case, I like that MGL has implicitly given us the framework for a zero-sum game re infield defense.  And it’s what happens at the edges of the zones that matters.  It is indeed the Krzywicki effect at play.

We can model this, I think.  It would be better if there had been more stringer bias testing.  I suspect that there are a few rogues and another bunch of vulcan mind melds happening in a division or two.  Still, this should be workable.

Bill James style sensibility tests, which we will call multivariate hypergeometric tests from here forward (this to make ourselves appear more clever) ... that should work here.

A ground ball has to go somewhere, after all.  And if the spread amongst infielders, home to away, is more than luck itself expects ... that difference is our scorer bias in the critical area.

You agree?


#52    tangotiger      (see all posts) 2010/09/21 (Tue) @ 19:38

Vic:
http://www.retrosheet.org/location.htm

And yes, the framework that MGL laid out is pretty standard really, and applicable to fielding or goalies or what have you.  It’s always the same question being answered: “Given these characteristics or parameters or context, what would an average player have done?”

So, we use our baseball smarts to figure what kind of contexts to consider: batters, pitchers, park, count, pitch types, pitcher tendencies, base/out, game state, angle, distance, trajectory, hang time, fellow fielders, and what have you.

We then have to figure out the data quality, and how much bias and noise is in the metric.

Once you have determined all the parameters to consider, you figure out how you want to handle each of them (independently or in conjunction with other parameters).

It’s straight forward in concept, and somewhat straight forward in implementation.


#53          (see all posts) 2010/09/21 (Tue) @ 20:10

Thanks for the retrosheet link, Tango.  Much appreciated.

I apologize for the inconvenience in advance; can you link to the home/road splits for any of the fielding metrics?

And by the by, I’m most definitely not questioning the baseball smarts of the people who are tackling these issues.  The opposite in fact, I’m impressed, and I’m learning a lot. 

By the same token, I wasn’t taking a shot at Ken Krzywicki above, far from it.  I think his knowledge of the game of hockey is terrific ... that’s shown explicitly in his English language reasoning.


#54    Tangotiger      (see all posts) 2010/09/21 (Tue) @ 20:13

Vic, I didn’t take it that you were making a swipe at anybody.

The splits data is at the bottom of this page:
http://www.fangraphs.com/statsplits.aspx?playerid=1101&position=OF&season=0


#55          (see all posts) 2010/09/21 (Tue) @ 21:14

Thanks Tango, you’re a gracious host.

Even I could scrape that data off for all players.  It would be awfully nice if everyone settled on a standard player id system, though.

And surely this has been done already, no?  For a garden variery compufool like myself ... this could take a while.  Twenty minutes of programming and 3 hours of debugging lay ahead, I fear.  I suspect that you, MGL, Colin, etc could surely mow this down in minutes.  One of you probably has already.  Am I wrong?


#56    tangotiger      (see all posts) 2010/09/21 (Tue) @ 21:37

MGL supplies the data to Fangraphs.


#57          (see all posts) 2010/09/21 (Tue) @ 22:52

Has he supplied the UZR home/road splits anywhere besides the individual player pages on Fangraphs?


#58          (see all posts) 2010/09/23 (Thu) @ 11:11

Tango

Regarding your point in comment 4 about using away games to eliminate stringer bias; I used the data from fangraphs that you linked to, the home road splits, 2006 through 2009 seasons.

I just used a PHP script to scrape the data.  Super simple and super slow, but it seemed to work.  Hopefully my data is correct.  I used UZR/150 as my criterion for fielder quality.

Anyhow, for 1B, 2B, SS AND 3B:

SEASON TO SEASON Pearson’s r (min 25 games played at the position, both home and away in both seasons):

Home—r = .33
Away—r = .19

So Colin Wyers’ point about stringer bias would seem to be a significant one.

.

Same thing now but for outfielders:

Home—r = .10
Away—r = .46

I have no idea why that would happen.  Go figure.  I’m sure that someone has checked this already and dug deeper.  I’d be interested in hearing their rationale.

The infielders results by this crude metric ... they don’t seem to be affected much by the cutoff point for games played at the position (in this case min 25 both home and away).  The Home r is about double the Away r in every case.

For outfielders it swings wildly.  The higher the bar is set for qualification ... the closer the home and away numbers get.

I’m confounded.  I’ll have to read through the UZR description in detail, I suppose.


#59          (see all posts) 2010/09/23 (Thu) @ 11:38

On the off chance that anyone is still reading this thread, the data I used is contained in an html table at this url.
http://timeonice.com/uzrdata0609.htm


#60    Tangotiger      (see all posts) 2010/09/23 (Thu) @ 11:55

Great stuff!  I love that we have more questions than answers.


#61    Colin Wyers      (see all posts) 2010/09/23 (Thu) @ 14:39

I took a bit of a different tack - I just looked at expected outs, which is where we should expect to see scorer bias. (And which, absent bias, should be independent of the player involved.) And I summed to the team level, to relieve me of needing any playing time cutoffs.

Year to year correlations on the outfield:

Away Home
LF:  .32 .17
CF:  .17 .32
RF:  .25 .36

That’s, uh, kind of odd, innit?


#62    BenJ      (see all posts) 2010/09/23 (Thu) @ 15:27

Colin/61, how does that account for GB/FB tendencies, or park differences?  It seems to me a FB staff will have a lot of outfield ExpO every year.  Or am I missing your point?

Even team-switching players bothers me.  We assume that things will even out for across all players who switch teams, but can we be sure? 

Good outfielders are most useful on fly ball staffs and in big ballparks.  Wouldn’t there be some tendency for players to move between teams that value them the most? 

One attendee at the Sportvision Summit voiced what I’ve been thinking:  year-to-year correlation isn’t a perfect test. 

Pulling a page out of an old book, what about comparing Defensive Metric X on the team level to Runs, or Wins?  That has it’s own set of problems, but it’s more bottom-line than year-to-year.

Just some thoughts… it’s a complicated issue.


#63    Colin Wyers      (see all posts) 2010/09/23 (Thu) @ 15:32

Colin/61, how does that account for GB/FB tendencies, or park differences? It seems to me a FB staff will have a lot of outfield ExpO every year. Or am I missing your point?

I don’t know as I made a point, exactly. I’m noodling with the numbers.

And you’re right, we should expect to see some y-t-y correlation with expected outs. (How much? I dunno.) But we should, in the aggregate, see high-K teams behave as such home and road, and same with high-FB teams. It’s the gap between the home and away correlations that’s interesting, to me. I can’t explain the LF data, at all. And I don’t know how much of those differences in correlations are significant.


#64    Colin Wyers      (see all posts) 2010/09/23 (Thu) @ 17:44

Pulling a page out of an old book, what about comparing Defensive Metric X on the team level to Runs, or Wins?  That has it’s own set of problems, but it’s more bottom-line than year-to-year.

I probably should have responded to this as well - it turns out that what we’re trying to do with defensive metrics doesn’t really tell us anything about team wins in the retrospective sense.

That’s going to turn some heads, so I’ll clarify.

At a team level, what matters in term of wins is:

* How many BIP there are,
* How many of them are outs, and
* The base advancement value of the non-outs (single, double, triple).

That’s it. The first obviously isn’t a function of defense at all, but the other two are to some extent (how much is the question under consideration).

Where all of the interesting questions come in regards to defensive metrics are the split of responsibility for those last two categories is. You have two functional reasons for trying to split credit:

* Splitting responsibility between fielders as a unit and “batted ball distribution” (which you can either assign to pitchers, or simply ignore, depending on your feelings there), and
* Splitting responsibility between fielders who are part of the same unit.

The easiest way to come up with a defensive metric that is in high agreement with team runs/wins is to ignore those things. Seriously - we can count plays made and BIP, right? That gives us DER - maybe wDER corresponds a bit better to team wins, but we can do that pretty easily too. And we can count that at the player level, too - so and so made so many plays, and was on the field for so many BIP.

So you have these nine “buckets” (10, I guess, if you distinguish between “pitcher as pitcher” and “pitcher as fielder"), and plays are pretty easy to figure out (except for that “pitcher as pitcher” bucket - it’s a significant piece of the puzzle, but I’d suggest ignoring it for now if that helps you picture it better). The question is really how you want to parcel out the BIP. And so long as when you add it all back together you still have the same amount of BIP at the team level, there’s no relationship between the quality of the metric and the correlation with team runs/wins!

At the macro level, what you have is your measure of team defense and your measure of team pitching on batted balls - in UZR’s case you have PZR, in PMR you have predicted outs behind pitchers:

http://www.baseballmusings.com/archives/030232.php

(The lack of team PZR on Fangraphs is probably the biggest reason you’d see a significant deviation between a team’s third-order win percentage and their team fWAR plus rep-level wins, incidentally.)

If you’re omitting the “pitcher"/"batted ball distribution” part of the equation, the defensive metric you will see have the greatest relationship with team runs allowed is the one with the smallest share of responsibility being assigned to P/BBD - regardless of the real importance of P/BBD. If you are including P/BBD runs in your analysis, then unless someone is flat out screwing up the math, you shouldn’t see any significant difference in agreement with team runs/wins.

What we really need to pay attention to is the distribution of BIP credit (or to follow the lead of most defensive metrics, the concept of expected outs/predicted outs/etc.) The two questions that we have to resolve in order to ensure that defensive metrics are “sound,” is:

* Does a player’s environment have an impact on his expected outs that is not related to the actual difficulty of his fielding chances?
* Does a player’s own abilities on defense have an impact on his expected outs?

Team runs allowed doesn’t tell us anything about that.


#65    Tangotiger      (see all posts) 2010/09/23 (Thu) @ 17:58

Finally.  I’ve been hoping someone would articulate this, as I’ve seen the point brought up time after time.

If you want to get ridiculous, just do Runs allowed by PO+A, times player PO+A.  You get 100% correlation.  I don’t know what the y2y correlation is, but I’ll bet you it’s more than 0.


#66          (see all posts) 2010/09/23 (Thu) @ 18:04

I don’t know if this is what Ben meant, but using fielding metrics projected to the following year and testing against team defense in that following year should tell us something worthwhile.

This is what Rally did here for one year:
http://www.hardballtimes.com/main/blog_article/evaluating-defensive-projections/


#67    Peter Jensen      (see all posts) 2010/09/23 (Thu) @ 18:28

Everything that you say in post #64 is a well reasoned description of the fielding metric conundrum.  My feeling, and this may have been Ben’s point as well, is that agreement with team runs is a necessary but not sufficient requirement of any fielding metric.  One would still have to make some reasonable estimate of the allocation of responsibility between pitcher and fielders as a group, but having the total of individual fielder’s runs way out of whack with total team runs allowed minus pitching runs allowed would be a credible sign that a fielding metric has fundamental problems.


#68    Tangotiger      (see all posts) 2010/09/23 (Thu) @ 19:17

But we don’t insist that for hitting, do we?  You can have team runs scored that is 50 or 80 runs more or less than the component runs scored (say BaseRuns) and we don’t bat an eye lash.

The reason is sequencing.  If you have a team that gives up a high OBP or SLG with men on base, then naturally the runs scored will be much higher than the component version.

At the least, if you want to make it “Add up”, it has to be against the component runs allowed, not the actual runs allowed.


#69    joe arthur      (see all posts) 2010/09/24 (Fri) @ 08:46

I think Peter and Tango add excellent clarifications to Ben’s suggestion: reconciling to “runs” is a necessary but insufficient condition to have a good quality fielding metric, but assuming you’re ignoring sequencing, that means component runs. But that’s not quite the end of it. Colin alluded to park effects at the end of #64, and depending on how and where a metric brings in park adjustments, that might have to be amended to park adjusted component runs - and this takes us further away from testing against the objective standard of runs or the nearly objective standard of component runs ...

Didn’t someone write an article a year or two ago experimenting with extending the “Value Added” approaches of Gary Skoog, Mark Pankin and Tom Ruane to fielding? I can’t remember enough details and can’t find it quickly. That’s a very different framework which would have the advantage of reconciling to runs.


#70    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 09:05

Joe makes the necessary point as well.  Basically, any adjustment you do to one side of the equation, you have to do to the other side of the equation as well.


#71    Peter Jensen      (see all posts) 2010/09/24 (Fri) @ 10:12

Joe - I calculate the run value for my BZM fielding metric several ways, one of which is using run value added.  I may have mentioned that in the 3 part description of BZM at THT and that may be what you are remembering.  I also use run value added for the DIRVA pitching metric that I described last winter in a THT article.  That pitching metric divides the pitcher’s contribution into DIPS and hit ball components and compares the each individual pitcher’s hit ball component to the average team fielding rates for the year.  So in theory I have a system that has a pitching component unaffected by fielding, a team average fielding component, and a pitching component relative to the team fielding average, all of which so reconcile closely to actual team runs allowed.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 11:02
Do pitcher’s reach back for velocity when needed?

May 25 10:58
Rooting for laundry

May 25 10:14
Largest demonstration in Canadian history?

May 25 09:39
What sabermetrics is NOT

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story