THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, July 02, 2010

WAR, or WARP if you must… but not VORP

By Tangotiger, 09:59 PM

VORP ignores fielding, which means that VORP is not a “total” type of stat.  WARP is such a stat.

I’ve seen many articles like this that uses WAR, good articles at Fangraphs or ESPN or Beyond The Boxscore.  In this day and age, ignoring fielding is really being lazy (or ignorant).  (I would not use WARP, because of its reliance on FRAA.)

Anyway, WAR or WARP, not VORP.


#1    JD      (see all posts) 2010/07/02 (Fri) @ 22:18

Playing Devil’s Advocate a bit, but how reliable is WAR when the one-year reliability of the fielding stat is questionable?

In other words, might it not be better to look at VORP to judge offense then look at defense separately?

I actually don’t do this. I always go to WAR. But I’ve been wondering how smart this is when even the best metrics (UZR for instance) can have wild one-year fluctuations.

Has anybody considered some sort of multi-year UZR calculation for WAR? Is this a really dumb idea?


#2    Kincaid      (see all posts) 2010/07/02 (Fri) @ 23:19

In other words, might it not be better to look at VORP to judge offense then look at defense separately?

That’s basically what WAR does.  It measures offense and defense separately, and then only as a final step adds them up to get a total contribution.  You have to combine them at some point if you are going to consider defense at all, so that’s not really any different from if you used VORP for offense and then looked at defense separately.

The only major differences I think between doing that with VORP and how WAR does it is that VORP puts the positional adjustment in the offense and bases it on offensive production, and I think they handle replacement level a bit differently.  Other than that, looking at VORP for offense and then looking at defense separately would more-or-less just be another form of WAR, as long as you are still looking at defense in addition to VORP and not just VORP by itself with no defense at all.


#3    minesweeper      (see all posts) 2010/07/02 (Fri) @ 23:38

ZR: UZR :: VORP: WAR

There’s just no reason to use ZR/VORP today.


#4    MGL      (see all posts) 2010/07/03 (Sat) @ 00:30

As long as you regress UZR properly there is no problem using it over any time period.  The same is true for offense of course. For some reason, we seem comfortable quoting unregressed offensive numbers but not defensive ones.  Again, the reason is that a sample of offense is what really happened, but not so for advanced defensive metrics.  But, that is a terrible reason not to regress offense (unless you are talking strictly about the past, and even then...). Unless you want to give the same credit to a batter for a 20 hopper through the 5 hole as you do for a screaming line drive up the middle.

I have always said, I don’t ever want to hear a sample number in a discussion about someone’s true talent or his future performance.  Ever.  Unfortunately, you hear it all the time, even from saber-friendly journalists like Rob Neyer.  If I had a nickel for every time I read that so-and-so should not be playing because he has hit X this year, or that so-and-so SHOULD be playing because he has hit Y this year, I would have lots of nickels yet continue to be nauseated.

Now, it is better to regress offense and defense separately and THEN combine them, but it is also better to regress K, BB, HR, and the rest of offense separately, rather than combine them into wOBA or OPS or RC and then regress.  So it all depends on how lazy you want to be.


#5    David      (see all posts) 2010/07/03 (Sat) @ 09:56

MGL, are you saying that if we’re going to use WAR that we should use a regressed UZR along with regressed batting stats?  I’m comfortable using current batting stats in WAR because, as you said, it’s what really happened even if there was luck involved.  Should season totals of UZR be regressed before calculating WAR and if so, how much should they be regressed?  What about half a season like where we are right now.  Do we regress to average or to the player’s career numbers if he has a large enough sample?  If we know a player over a large sample has been worth 10 runs on fielding and it’s halfway through the season and he’s at 5, it doesn’t make sense to me to regress that 5 with average, but I don’t know if that’s the correct way of doing so or not.


#6    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 12:09

As long as you regress UZR properly there is no problem using it over any time period.

In the immortal words of Tom Tango, summary opinion without evidence is bullshit.

Let’s list out some presumably testable claims here:

1) Using UZR to measure defense over a sample period is x runs more accurate than TotalZone.

2) Using UZR to measure defense over a sample period is y runs more accurate than FRAA.

3) Using UZR to measure defense over a sample period is z runs more accurate than simply assuming all players are average defenders for their position (the VORP assumption).

Can anyone tell me what the values of x, y and z are, relying on evidence and not a thought experiment? I’m betting they can’t.


#7    tangotiger      (see all posts) 2010/07/03 (Sat) @ 12:42

Colin’s point is made that if you need to regress more than 50%, and if you are going to do 100% UZR (or TZ or FRAA) or 0% those stats, then you are better off using NO FIELDING.

This is the same argument of BABIP, that if you have decided to do 0% or 100% (if you limit yourself to those 2 choices), then 0% is preferable.

***

That said, in terms of past accounting, you have to decide if the monkey throwing darts 20 years ago landing on MSFT and CSCO would “deserve” his million bucks or not.  That is, he has the cash in his bank account, but do we regress that to almost 0 because he was lucky to have picked Microsoft stock?


#8    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 12:50

Tom, this simply isn’t a question of “luck” or “deserving.” This is a question of how well any of these defensive metrics (including the ones I can claim responsibility for, due to my working for BP) actually represent the underlying reality.

And this isn’t a question of regression, either. You can’t solve this with regression! You’ve got three terms to explain observed variance:

VarO = VarTrue + VarRand + VarBias

What we want, if we want to know true talent, is simply the first term, with the other two excluded. If we want “value” or “performance” or whatever, what we want is the first two.

What we GET, if we regress fielding metrics, is:

VarTrue + VarBias

Regressing to the mean doesn’t address bias at all! Neither does increasing the sample size!

And I think we absolutely have to quantify these things - provide actual proofs that what our measurements say and what is actually going on has a meaningful correspondence at the individual player level - BEFORE we start telling people that “you have to use a fielding metric, preferably UZR, to evaluate a player.” There is NO EVIDENCE for that statement. And if I’m wrong - show me the evidence.


#9    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 12:56

Because this entire line of reasoning is proceeding from a logical fallacy:

1) We should be measuring defense when we evaluate baseball players,

2) UZR (or TZ or FRAA) purport to measure defense,

3) Therefore we should use UZR (or whatever) to evaluate baseball players.

The first claim is undoubtedly true. But in order to prove three you first have to prove number two. Do we know that UZR is, in fact, measuring individual player defense? And how well is it doing it? Is it actually doing it better than TZ or FRAA?


#10    MGL      (see all posts) 2010/07/03 (Sat) @ 13:03

David and Colin, I don’t really get the question.  What are you using WAR for?  And what “reality” are you trying to capture?  On offense if you want to capture the “reality” of how many singles, doubles, etc., a player actually got, then of course you use some unregressed offensive stat.  But even then, if you are using lwts or wOBA as your offensive stat, you are taking the “reality” of singles, doubles, etc., and turning them into “theoretical runs,” right?  Is that “reality?”

And what if the “reality” you are trying to capture is how well the batter actually hit the ball and not necessarily what he got out of it, like a line drive out is better than a bloop hit?

So, for defense, what kind of reality are you trying to capture?  When I say that UZR does not really capture reality, that isn’t really true.  To some extent it captures the reality of what it does, which is to simply record how often each player turns a certain kind of batted ball into an out.  That’s reality!  The fact that those “catch rates” get turned into “theoretical runs” is the same as offensive singles, doubles, etc. getting turned into theoretical offensive runs in lwts or wOBA!

If you want to talk about “reality” on defense the same way to talk about “reality” on offense, you better just use fielding percentage and forget about UZR, Totalzone, ZR, range runs, FRAA, etc.

Bottom line for me is that I don’t ever talk about sample performance.  To me, I ALWAYS talk about regressed performance if all I have is a certain sample, or long-term performance if I have a larger sample. Talking about how a player has done “this year” is never interesting to me. Ever.  Offense or defense.

And if you want to talk about performance in an “MVP” type of manner (only focusing on a small sample of performance), it is NEVER clear what the proper stats are to use (RBI, wins, ERA, RBI/Runs, etc.). Never.


#11    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 13:11

MGL, I’ll make it as easy for you as I can.

You define what UZR is supposed to measure. You define the time period it’s supposed to measure it in. And then you tell me how accurate it is at measuring it. That’s all I’m asking.


#12    MGL      (see all posts) 2010/07/03 (Sat) @ 15:47

It simply measures how often a certain kind of ball is caught as compared to an average fielder at that position.  You know that.  How “accurate” is it?  How do you want me to respond to that?  We don’t know exactly where the ball was hit or how hard it was moving, and we don’t know where the fielder was starting and why, and we don’t know how hard the ball was to catch, e.g., bad hops, spin, etc.  The large the sample, the more all of those things will even out. That is why we regress starting with 1 opportunity and extending to an infinite number of opportunities.  What is your point?

On offense, it is exactly the same thing. As I said, the fact that a player did in fact get a walk or a certain kind of hit is not interesting to me at all.  Not even 1% interesting. All I am interested in is what he is going to do in the future and what his true talent is in the present (and the future).  And for that, any offensive stat is EXACTLY like UZR.  Despite the fact that all batted balls get classified as outs, hits, errors, etc. for the batter, we don’t know how hard or where the ball was hit, how hard it was for a fielder to catch, etc.  In many sense UZR is much more accurate than any offensive metric, at least the ones that don’t use batted ball information, which is most of them. The fact that we know the classification of a PA for the batters tells us something of course about the batter’s true talent, but not a whole lot in the short run…


#13    terpsfan101      (see all posts) 2010/07/03 (Sat) @ 15:53

"I would not use WARP, because of its reliance on FRAA.”

Are referring to Davenport’s fielding runs, or any defensive metric that only uses basic fielding statistics? Fielding runs can be almost as reliable as the advanced defensive metrics if they are adjusted properly. Rally uses fielding runs in his WAR metric for the non PBP years. Last year I devised some fielding runs formulas from play by play data. I was going to see how well they correlated with UZR, but never got around to it. As far as I know nobody has tested fielding runs against the play-by-play metrics.


#14    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 16:07

MGL, this isn’t supposed to be a trick question. Tom is telling us that we should be using UZR over FRAA to evaluate player defense. I am asking how much more accurate UZR is than FRAA.

(And to address what terps said - FRAA as published on BP right now uses some PBP data, as described in the ‘09 book, from 2005 on.)


#15    terpsfan101      (see all posts) 2010/07/03 (Sat) @ 19:04

My problem with using WARP boils down to three things:

1. How are the fielding runs calculated? I know what types of adjustments are being made, but I don’t know the details.

2. How are the positional adjustments calculated? Are they based on 3 years, 5 years, 10 years worth of data? Are they based entirely on offensive production?

3. How is replacement level calculated?

The same 3 points also apply to Rally’s WAR. However, I trust Rally more than I trust BP.


#16    tangotiger      (see all posts) 2010/07/03 (Sat) @ 19:10

"As far as I know nobody has tested fielding runs against the play-by-play metrics. “

I hate it when people say “as far as I know”.  If it’s Thomas Edison talking about light, that’s meaningful.  Otherwise, why would terps need to be the barometer to test anything.  (No offense to terps… just ANYBODY who says that… me included.)

Anyway, I think Justin did that test, among many other people.

***

And FRAA is a subset of UZR.  There’s no new information that FRAA uses that UZR doesn’t already use.

FRAA is also probably a subset of TZ.

So, it makes no sense to use FRAA, anymore than you would use BA instead of OBP or SLG.


#17    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 19:34

Yes, Tom, and if STATS, BIS and MLBAM all had disagreements over how many walks a players had that were this serious, we’d probably be having this conversation about OPS.

I mean seriously, it’s a simple question. If UZR is more accurate than FRAA, how much more accurate is it? And if you can’t answer it, why do you think it’s more accurate?


#18    Chris Dial      (see all posts) 2010/07/03 (Sat) @ 19:49

My problem with using WARP boils down to three things:

1. How are the fielding runs calculated? I know what types of adjustments are being made, but I don’t know the details.

How does this not apply to UZR?


#19    Rally      (see all posts) 2010/07/03 (Sat) @ 19:59

It’s a simple question only if you have a standard to compare to.  How much closer does UZR meet this standard than FRAA?

If we could answer this question, we’d use that higher standard to measure defense, and have no need of UZR or FRAA, or any of the others for that matter.


#20    Colin Wyers      (see all posts) 2010/07/03 (Sat) @ 20:05

Well back it out to the team level, Rally, like we do with offense. How accurate is UZR at measuring team defense?


#21    Rally      (see all posts) 2010/07/03 (Sat) @ 20:15

That actually is not a good question.  A stat like TotalZone basically starts with team defense and then apportions the credit/blame to the individuals.  By definition it will be very accurate on team totals, that does not mean it is as accurate for individual players.

What I have suggested in the past is to take defensive projections derived from various systems, and use those to predict future team DER based on who actually winds up playing for each team.  But nobody’s ever taken the challenge, and I don’t have the time to do it myself.

Colin, On BP player pages how can I find past fielding ratings?  I tried Derek Jeter’s page, and I get his WARP which I assume includes FRAA (If I knew what you guys were using for replacement level, the exact offense rating, position adjustments I could back calculate it, but that is a ton of unknowns).
The historical stats section has last 3 years fielding ratings.  I look up Lou Gehrig and you’ve got WARP, but no breakdown.  Are these available on the site?


#22    Chris Dial      (see all posts) 2010/07/03 (Sat) @ 20:26

What I have suggested in the past is to take defensive projections derived from various systems, and use those to predict future team DER based on who actually winds up playing for each team.  But nobody’s ever taken the challenge, and I don’t have the time to do it myself.

This is actually something I worked on in the off-season.  Basically I calculated what a pitcher did in front of his specific defense, and how that would be effected in front of average defense *based on his specific BIPs*.  Then you can change the position players, subbing in and out differnt quality of defenders and see how that would impact a given pitcher’s RA


#23    terpsfan101      (see all posts) 2010/07/03 (Sat) @ 20:47

Tango, Justin’s study only used one year of data and was limited to 250 players. Plus his conclusions only apply to Davenport’s Fielding Runs. We still don’t know how well Palmer’s fielding runs, Rally’s total zone (non PBP), David Gassko’s range, Dan’s simple fielding runs, my fielding runs, and Colin’s simple zone rating correlate with the PBP systems.


#24    terpsfan101      (see all posts) 2010/07/03 (Sat) @ 20:53

It looks like the detailed information that used to be included on the BP player cards has been removed. FRAA used to be listed on the player cards.


#25    Chris Dial      (see all posts) 2010/07/03 (Sat) @ 21:02

We still don’t know how well Palmer’s fielding runs, Rally’s total zone (non PBP), David Gassko’s range, Dan’s simple fielding runs, my fielding runs, and Colin’s simple zone rating correlate with the PBP systems.

Not exactly.  Of course, that’s *my* DRS It also isn’t TZ of Rally’s, but its precursor.


#26    terpsfan101      (see all posts) 2010/07/03 (Sat) @ 21:55

Chris, your Dr. Strangeglove is based on Stats ZR, which is figured from PBP data. I was only referring to defensive metrics that use basic fielding statistics (PO, A, E, DP) for their inputs. Interestingly, Rally found that ZR correlated better with UZR than TZ did. Of course, this was 2 years ago. Rally has since modified the way he calculates total zone.


#27    Chris Dial      (see all posts) 2010/07/03 (Sat) @ 22:34

terpsfan,
I understand, but the grid compares Gassko’s Range as well, which is traditional, and the Rally stat there is traditional.


#28          (see all posts) 2010/07/03 (Sat) @ 22:53

What is this reality stuff?.  You are trying to measure a players true fielding ability, with X uncertainty, and to do this you say you need 3 years data.  That data is simply a accumulation of all the data on balls hit to a players position over over 3 years.

If the 3 year UZR has any value, then so does the SSS data.  If SSS has no value, even after regression, then 3 year UZR does not have any value either.  Can’t have it both ways, even if you are not interested in SSS data or how a player has fielded this year. 

And BTW, after 3 years fielding data is collected, the player is older, and thus UZR may not be that predictive.  All it can do really is say over the past 3 years the player was good/bad or whatever.

While a 40 game sample of UZR tells us nothing of a players ability, it does tell us that over say 40 games the number of balls fielded at his position, or not fielded, is either above or below the league average, within the limits of the uncertainty in the data being collected.  That’s useful IMHO.  So if someone says Beltre was been terrible defensively and the UZR says otherwise, as do the eyes, then UZR can support such an argument within the limits of it’s estimated uncertainty and assuming UZR is not underestimating the impact of errors.

In terms of MVP, “what happens” means producing actual runs.  I know of no stats other than the HR and RBI, or run scored, that actually changes the scoreboard every time, and not considering RBI’s in an MVP vote is beyond silly. 

Theoretical estimators should not be used alone to determine who is the MVP.  A walk does not always contribute to a run for example. 

I would like someone to come up with a stat called the assist that is tallied when a productive out, hit or walk contributes to an ACTUAL run scored without an RBI or run scored for the batter.

UZR and DRS are fine for 1 year fielding for MVP purposes as we do not have play by play fielding data, unlike offensive stats.

If you are trying to vote for the player with the most ability, or who had the best year, as opposed to being the MVP, then you can use the theoretical offensive estimators and discard the RBI, but even these have uncertainty.  A batter hitting in the Orioles lineup has a much tougher go of it than a batter hitting in the Red Sox lineup that wears pitchers down (not to mention facing Orioles pitching instead of Red Sox pitching).  Hard to compare ability of 2 batters hitting in such different lineups/SOS.


#29    Rally      (see all posts) 2010/07/03 (Sat) @ 23:38

"If the 3 year UZR has any value, then so does the SSS data.”

What the heck is SSS data?


#30    Colin Wyers      (see all posts) 2010/07/04 (Sun) @ 00:13

For any DT stuff that isn’t in the cards or the sortables, you can typically find it by poking around in here somewhere:

http://www.baseballprospectus.com/statistics/eqa2010.shtml

That’s if you’re interested in just browsing around. If you’re looking for a data dump, just shoot me an e-mail.


#31    minesweeper      (see all posts) 2010/07/04 (Sun) @ 00:19

pft, the problem lurking behind your post is that you elevate your own personal criteria for MVP to that of objective platform.  When you say “In terms of MVP,” you really should say instead, “In terms of *MY* MVP"…

I personally don’t reference RBI ever.  I honestly don’t think I’ve sought out RBI values in...well, since I began to learn SQL and would ORDER BY RBI.  So I’d say two years.  Research and logic tell me that RBI is a byproduct of what I would like to actually measure.  It’s a statistic filled with noise and clutter.  If I ignore it when evaluating a player’s ability, then I likewise should ignore it when evaluating a pool of players for MVP candidacy.  It seems contradictory to accept RBI as a polluted statistic and then use - no matter how insignificantly - it when reviewing the qualifications of various MVP candidates.

Now that is my opinion.  Others, seemingly like you, hold a conflicting opinion that traditional metrics should be considered because an MVP vote should go toward those players who contributed the most Real Runs, even if a not-too-small percentage of those Real Runs resulted from batted ball luck, timing, and miscellany outside of one’s control.  Because “MVP” is not clearly defined, and because “value” means so many different things to different people, these two viewpoints must coexist in one of the truly rare moments when it is actually correct to assert “well, that is just my opinion.”

Of course, for most people that just means they haven’t fully considered what they’re even voting for.  For most people “my opinion” in this context means that they value some criteria that they won’t consistently apply historically or even topically.  Like one writer may not vote for Hanley Ramirez because Hanley’s team did not make the playoffs, and then this same writer will turn around and for vote Prince Fielder because Fielder clubbed 40+ HR.


#32    minesweeper      (see all posts) 2010/07/04 (Sun) @ 00:21

addition:

...to sum, you say “If you are trying to vote for the player with the most ability, or who had the best year, as opposed to being the MVP...”. In my mind, the player who had the best year IS the MVP…


#33    Rally      (see all posts) 2010/07/04 (Sun) @ 00:30

Colin, didn’t find anything there for defense. Say I want to find how many runs Kirby Puckett saved by FRAA in 1988.  Am I SOL?

I have no interest in his vorp or warp without the components broken out.


#34    Colin Wyers      (see all posts) 2010/07/04 (Sun) @ 00:35

It’s there, but we aren’t exactly making it easy to find. (I essentially had to guess the URL of the old player card to find it.)

http://www.baseballprospectus.com/dt/puckeki01.shtml

He had 3 FRAA for ‘88.


#35    terpsfan101      (see all posts) 2010/07/04 (Sun) @ 00:52

Thanks for finding the old DT cards. The BP playerID’s match up to the lahman/BDB/baseball reference ID’s. Well they match up 99% of the time. So if you want to look up any player’s FRAA, follow Colin’s link and substitute the baseball reference ID.


#36    MGL      (see all posts) 2010/07/04 (Sun) @ 01:04

"I am asking how much more accurate UZR is than FRAA.”

I don’t know anything about FRAA, so I can’t answer that question, and in any case, I don’t really know how to answer a question like “How much more accurate is A than B?” I know what OPS is and I know what OBP is, but I don’t really know how to answer the question, “How much more accurate is OPS than OBP in terms of representing a player’s rin producing performance/ability?”

“A lot?”


#37    MGL      (see all posts) 2010/07/04 (Sun) @ 01:07

"SSS” is small sample size I think.

Using UZR (or any defensive metric) to predict team DER is not such a gold standard for evaluating UZR because DER is subject to park effects and to pitcher G/F ratios.

I think the best, by far and away, standard for evaluating a defensive metric is WOWY.  I am not exactly sure how to do that, but I think it is the best (and perhaps only) way…


#38    Nick Steiner      (see all posts) 2010/07/04 (Sun) @ 04:54

What I would like to see someone do is to take certain groups of defenders via UZR and compare that to WOWY with the same groups.  In other words, you look at players with UZRs of:

-20 to -15
-15 to -10
-10 to -5
-5 to 0, etc.

And see if WOWY shows that the -20 to -15 group actually saved that amount of runs.  You could even break it down to smaller groups if you wanted. 

Because you would be doing it leaguewide, I don’t think there would be sample size errors with the WOWY. This would seem like something pretty simple for Colin or MGL to do and it would, IMO, be the best possible test of UZR/TZ/FRAA.


#39    dq      (see all posts) 2010/07/04 (Sun) @ 11:00

"Using UZR (or any defensive metric) to predict team DER is not such a gold standard for evaluating UZR because DER is subject to park effects and to pitcher G/F ratios”

You should be able to make those adjustments (and any other that are required) and reconcile whatever defensive measure you have to DER.

DER is the most objective measure.
DER is what actually happened. How many balls were fielded out of how many possible.

I would love to see a reconciliation to UZR (or any other fielding metric) to DER; that would lay out what assumptions and adjustments are being done to UZR.


#40    MGL      (see all posts) 2010/07/04 (Sun) @ 14:47

"This would seem like something pretty simple for Colin or MGL to do and it would, IMO, be the best possible test of UZR/TZ/FRAA.”

The problem is that it doesn’t tell you much about the robustness of the metric. I can also create a metric that has everyone at zero and if we compare that to team DER, guess what?  It works perfectly!  I can create a metric that has all the above average defenders at +2 and all the below average ones at -2, and that will work also.

My guess is that all metrics will work if you compare it to DER.  The one with the largest spread is of course the most robust.


#41    MGL      (see all posts) 2010/07/04 (Sun) @ 14:49

The thing is, they all work because they are all based on sound methodologies and reasonably sound data.  They really are.  The ones that use the most granular data are the ones that are the best, it is as simple as that.

Now, the ones that use granular data and the ones that don’t will converge as the sample sizes get larger, although they may not completely converge depending on what data the less granular ones are using.


#42    Nick Steiner      (see all posts) 2010/07/04 (Sun) @ 17:02

MGL - I don’t understand your objection to my test.  I’m not advocating comparing to DER, I’m advocating testing whether or not players UZR’s are reflective of how many runs they saved - in the aggregate.  Of course if you set everyone at zero then it would break down, but nobody’s doing that. 

The point Colin is making is that we don’t even know that they “work” just because the methodologies and data are sound. There needs to be an actual test.


#43    Colin Wyers      (see all posts) 2010/07/04 (Sun) @ 18:46

Yeah. That and I totally dispute the assertion that the data is sound. (I guess I also dispute the assertion the the methods are sound, too - if they were ALL sound there’d be more agreement, wouldn’t there?)


#44    Tangotiger      (see all posts) 2010/07/04 (Sun) @ 20:21

I mean seriously, it’s a simple question. If UZR is more accurate than FRAA, how much more accurate is it? And if you can’t answer it, why do you think it’s more accurate?

In order for me to answer it, I’d need to have a data dump the way MGL has provided it for me in the past.

retroid,yearid,teamid,pos,runs

If you want to do that, email me at my tangotiger.net address.

In any case, it has to be more accurate because FRAA is a subset of UZR.  I don’t know why you ask me the question whereby I give the same answer.  Are you disputing that FRAA is a subset?  If so, what is it that FRAA does that UZR does not do?


#45    terpsfan101      (see all posts) 2010/07/04 (Sun) @ 22:49

UZR 2002-2009:

http://spreadsheets.google.com/pub?key=tjzH1kXO74ZIhhUMz8pytRQ&output=csv

UZR 1999-2003:

http://spreadsheets.google.com/pub?key=tvdNP7RO_wccwxK1xBN2Gwg&output=csv

I don’t know where you are going to find FRAA. If you are serious about comparing UZR to fielding runs, than I will calculate fielding runs using the formulas I devised in this thread:

http://www.insidethebook.com/ee/index.php/site/comments/simple_zone_rating/


#46          (see all posts) 2010/07/04 (Sun) @ 22:55

In any case, it has to be more accurate because FRAA is a subset of UZR.

That’s not true.  By doing more, UZR or Plus/Minus could be losing accuracy over FRAA.  Doing more with the data does not ipso facto mean doing better.

Tom, doesn’t the fact that +/- DRS and UZR disagree so badly bother you quite a bit, when both are calculated on the same ball-in-play data from BIS?  Seeing the huge disagreements between the two systems pretty much sank my confidence in both systems.  I’m willing for my confidence to be restored, but no data I have seen so far has done that.

An explanation or examination of the methods of each system absent real data is not what is needed here.  That is not proving anything.  UZR and +/- both seem like well-designed systems.  I even like some of UZR’s construction better.  But that doesn’t seem to translate to better year-to-year correlations for UZR.  So good construction be damned, I want to see either good year-to-year correlations (at some position other than shortstop) or an explanation of why the two systems have such different results.  Simply taking it on faith that what seems like sound construction will lead to sound results is a recipe for bad sabermetrics.


#47    MGL      (see all posts) 2010/07/05 (Mon) @ 00:26

"MGL - I don’t understand your objection to my test.”

I don’t have any objection.  But I already know the results, and, as usual, I am willing to put my money where my mouth is.  ZR, P/M, UZR, Total Zone, whatever.  They are all going to work when compared to a WOWY. I personally wouldn’t waste my time doing it.

As far as Dewan P/M and UZR, they do NOT horribly disagree.  They more or less agree. It does not bother me at all that they disagree some and even horribly disagree on a few.  It is the nature of the beast.  They both work fine. They have to, unless someone is making up code. The fact that there is some significant disagreement does not mean squat.  They both work fine.  I don’t know why some people are obsessed with the fact that they don’t agree as much as they would like.  I guess it is because they just don’t understand how they both work.  Or they just like to be contrarians. Probably both.  And the fact that some people quibble about the data is ridiculous. The data is not perfect, but it is just fine, given what it represents. Obviously we would like to have hang time and the like, but if you are quibbling about whether some line drives are really fly balls or ground balls, or things like that, you really have nothing better to do.  It doesn’t make any difference. It really doesn’t.


#48          (see all posts) 2010/07/05 (Mon) @ 00:39

As far as Dewan P/M and UZR, they do NOT horribly disagree.  They more or less agree.

Wow.  Speechless here…


#49          (see all posts) 2010/07/05 (Mon) @ 00:57

Okay, I’m regaining my capacity for speech here.

MGL, for claiming that you’re all about data and don’t make personal attacks, I find it very disappointing that your response on this issue devolves to claiming that Colin and I (1) don’t grasp basic statistics, (2) don’t understand defensive systems like UZR and +/-, (3) just want to be contrarian, and (4) expect the data to be perfect before we will use it.  That’s sad.  I would think we’ve earned a better response than that.  If you don’t feel in the mood to defend your system with facts, that’s fine, but please don’t descend to insults instead.

As for facts, for 2009, the RMSE for starting position players (>1200 defensive innings) between UZR and +/- DRS from the same BIS data set is 7.5 runs.  If you call that “more or less agree”, then we have a very different standard. 

It gets even worse when you start looking at fielders who switched teams, in which case the year-to-year correlation for both UZR and +/- DRS drops practically to zero for non-shortstop fielders.


#50    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 01:02

That’s not true.  By doing more, UZR or Plus/Minus could be losing accuracy over FRAA.  Doing more with the data does not ipso facto mean doing better.

As I’ve stated many times in the past (but not in this thread), as long as the extra things being done are intelligent things, then UZR and +/- cannot lose accuracy.

Even if there is some (not alot) of bias in the extra data being used (say like speed of batted ball), it will still help the PBP systems.

Tom, doesn’t the fact that +/- DRS and UZR disagree so badly bother you quite a bit, when both are calculated on the same ball-in-play data from BIS?  Seeing the huge disagreements between the two systems pretty much sank my confidence in both systems.  I’m willing for my confidence to be restored, but no data I have seen so far has done that.

It all depends on what things are done.  +/- seems to be mostly a subset of UZR.  That is, UZR has all the parameters of +/-, plus more.  Perhaps that’s not true any more, I don’t know.

It really is about how the run values for each zone are determined.  +/-, the first version did no smoothing whatsoever, simply treating each bucket by whatever level of granularity it decided on (very low from what I remember), so that you had such small buckets.

UZR had the opposite problem, creating too-huge buckets.

Shane’s system is better conceptually because it created a smoothing function.  That’s how I would do it.  I don’t know if UZR does that now or +/-.

So, I’m not surprised by any large disagreements on processing the same data.

It all depends on how you slice the data, and furthermore, what adjustments you apply for the additional parameters, like handedness, fb/gb tendencies, parks, etc.


#51          (see all posts) 2010/07/05 (Mon) @ 01:13

So, I’m not surprised by any large disagreements on processing the same data.

I didn’t ask if you were surprised.  I asked if you were bothered.  How can you use something that has such a big error margin around it?  Don’t you consider a +/-7.5 run error margin troubling?  That’s +/- $3 million dollars a year, or something in that neighborhood, right?  Is that good enough?  Is it any better than you get with Total Zone or equivalent systems?

If you believe UZR is better than +/-, has anyone ever showed that?  My investigations have not shown one to be much better than the other.  (They both look equally poor.)


#52    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 01:35

Not sure how to answer the “bothered” part. 

And I wouldn’t say they are “equally poor”.  If you look at 3-yr UZR, the results conform reasonably well to the Fans’ Scouting Report.  When I look at career WOWY (not that it was brought up here), there are no big surprises in rank (just in magnitude).

Even career FRAA does ok (which is why I would be fine with WARP, which is the lead-in to this thread).  This thread has diverged from its intent, which is to say that you need to include some fielding component.  Colin had a good argument (as I interpreted) that having no fielding is better than having an unregressed fielding component, because 0% is closer to the actual talent level of the underlying observed metric than 100% of the observed metric.  It’s a good enough argument, if very unsavory (basically, the FIP or xFIP argument).


#53    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 01:38

Unsavory in its overall implication, not in its premise, similar to the need for taxes.


#54    Colin Wyers      (see all posts) 2010/07/05 (Mon) @ 01:49

Lemme start off by saying - I really don’t understand some of what feels like personal acrimony on display here. (And this isn’t an isolated thing, either, which is why I stopped posting here for a while, and will probably stop posting here again shortly.) I’m trying to have a conversation about data and methods - it shouldn’t be personal.

Hitting a few of the highlights here…

In any case, it has to be more accurate because FRAA is a subset of UZR.  I don’t know why you ask me the question whereby I give the same answer.  Are you disputing that FRAA is a subset?  If so, what is it that FRAA does that UZR does not do?

I am disputing that, at least if you want to draw equivelence to the difference between batting average and on-base percentage. The reason that’s a subset is that what you are doing is taking everything in AVG and adding one set of facts (walks) to that. And so that’s how OBP improves on AVG.

When it comes to fielding data, you have two kinds of data:

1) Facts. These are things that are falsifiable - how many plays a fielder made, for instance. So long as you’re agreeing on the definition of play made, you can come to an agreement on the number of plays made (until you start subdividing down to GB PM, FB PM, etc.)

2) Subjective data. This is things like batted ball classifications, hit location data ... some of it is potentially falsifiable, but as it stands nobody has been able to falsify it. (And some of it, like the batted ball types, is not and cannot be factual data. It is entirely subjective data.)

What does FRAA do that UZR does not do? It restrains itself to factual data, or at least more so than UZR does, and ignores subjective data, or again at least more than UZR does. (Some batted ball data is utilized, where it is available, but in a very coarse fashion.) It is possible that the subjective data in UZR (or DRS, or vanilla ZR, or whatever) improves upon the objective data used to calculate FRAA (or DSG’s Range, or James’ Defensive Win Shares, or what have you - ideally we’d have a fully play-by-play system that restrains itself wholly to factually data so that we can compare apples to apples; I may take that on here in the near term). But it is not certain that it does.

As I’ve stated many times in the past (but not in this thread), as long as the extra things being done are intelligent things, then UZR and +/- cannot lose accuracy.

Even if there is some (not alot) of bias in the extra data being used (say like speed of batted ball), it will still help the PBP systems.

Okay, but… is that true? And if it is true, how much is “some (not alot) of bias?” How much is alot of bias? And how much bias is present in the data? And even if there is not alot of bias in the data, how do we know that the extra things being done to it are intelligent? And how do we know that intelligence is sufficient or even necessary?

Because let’s go ahead and conceed that MGL is a intelligent person who has a lot of experience with the data. And let’s conceed that John Dewan is an intelligent person who has a lot of experience with the data. But what we see is that intelligence doesn’t cause them to converge upon very similar methodologies. And so it seems to me that you can apply a lot of intelligence to the problem without coming to a consensus on what should be done.

It all depends on what things are done.  +/- seems to be mostly a subset of UZR.  That is, UZR has all the parameters of +/-, plus more.  Perhaps that’s not true any more, I don’t know.

...

So, I’m not surprised by any large disagreements on processing the same data.

It all depends on how you slice the data, and furthermore, what adjustments you apply for the additional parameters, like handedness, fb/gb tendencies, parks, etc.

And what it comes down to - without a standard of accuracy, some sort of benchmark, we have no way of knowing which adjustments are correct. So when MGL says he has no idea how to answer the question (and frankly I’m not very certain I have the answer yet, either) - when MGL makes a change to the methodology of UZR, how do we know that UZR has gotten better? Or has it gotten worse? Is it better? Or is DRS better? Or is the unadjusted data (or at least minimally adjusted - RZR and some measure of OOZ plays, like OOZ/Inn, converted to plus-minus) just as good or better? All three of them have practically identical year-to-year correlations, for instance.

To sort of bring this back around, in essence there are two components to measuring fielding:

1) How many plays did a player make, and
2) How many chances did he have to make plays?

The first is a simple factual question. The second is not measurable directly - we have to estimate it from other parameters, some factual (batter handedness), and some not (batted ball types). The assumption that UZR is superior to other fielding metrics (at least, ones that don’t use the same level of batted ball data) is that the data is increasing the accuracy of the estimate of chances.

But it’s an assumption, and an untested one at that. It’s as if we knew a batter’s hits, waks and hit by pitches, but didn’t know his plate apperances (and didn’t know his batting outs or what have you to derive plate appearances indirectly). And so we have to guess how many PAs a hitter might have had. Or if maybe, say, you had major discrepancies between the number of PAs recorded by STATS, and BIS and MLBAM - the number of hits and walks stays the same, but PAs change dramatically. So if you have a player with an above-average OBP - is he really above average at getting on base, or is he just having his PAs underreported?

And these are issues I think we need to resolve, or at least acknowlege, before endorsing ANY fielding system.


#55    Colin Wyers      (see all posts) 2010/07/05 (Mon) @ 02:01

As for the request for FRAA data associated with retroID:

http://www.editgrid.com/user/cwyers/fraa_data

That is the data as it appears on the site (as of the time of this post) - as in, the 2005-2010 data uses a different method than the other data.


#56    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 02:22

I really don’t understand some of what feels like personal acrimony on display here. (And this isn’t an isolated thing, either, which is why I stopped posting here for a while, and will probably stop posting here again shortly.)

You should impugn someone in particular. You shouldn’t make such a specific claim and generally lay it out to this blog and its readers. 

We’re all big boys who respect each other, and I think that if this was a bar, we’d have resolved this issue in 2 minutes.  Same deal with Matt’s issue from a few months back.

The strength of this blog is the personalities of the commenters who post here.  Sometimes, things come out stronger than they otherwise would, like drunks at a bar.  But you still go back to that bar, because at some point, you become the drunk that says something the wrong way.  Me included.

But if you’ve decided that you just want a divorce, because the effort to get things in a better shape is more than the value you get in posting here, so be it.  To me, that would be almost impossible.  It would have to be a series of many many personal attacks and general meanness for me to want to stop posting here.  That’s because I believe in the sincerity of the commenters, and the belief that they want to move discussions forward, and to look past any ego-ness (that a word?).

***

What does FRAA do that UZR does not do? It restrains itself to factual data, or at least more so than UZR does, and ignores subjective data, or again at least more than UZR does.

This is the WOWY argument that I support.  But it doesn’t apply to FRAA, because FRAA takes factual data, like handedness split at a team level, and applies it to pitchers (based on the last reading of the system published).  There are alot of arbitrary decisions made in FRAA that even though it uses factual data, it combines them in whatever way Clay figured made the most sense.

Now, if you tell me that Clay uses the actual number of PA by LHH and RHH in the games by a particular pitcher, then great.  That adds to FRAA’s value.  But, I haven’t seen that.  Is this what happens?

WOWY, for example, DOES look at the actual handedness split, in addition to the actual identity of the batter, since not all LHH are the same.

***

I don’t have a disagreement with the rest of your post, other than this:

“And these are issues I think we need to resolve, or at least acknowlege, before endorsing ANY fielding system. “

BPro currently (implicitly) endorses FRAA.  I’ve stated in the past something like this:

80% UZR
70% PMR
60% ZR / FRAA
50% Palmer

Something like that.  That’s pretty much the scale, based on granularity of data used, and intelligence in putting it together.  FRAA uses more than ZR, so it elevates it to ZR’s level.

TZ is somewhere between 65% and 75%, as is SFR and WOWY.


#57    Colin Wyers      (see all posts) 2010/07/05 (Mon) @ 02:45

Well sure. Look, all the fielding metrics come down to two things:

1) The inputs, and
2) How you handle them.

To make it clear, in case it wasn’t already - I am not claiming that FRAA has superior inputs or methods of handling those inputs. I’m simply claiming that right now we lack a way to determine what inputs and what methods are superior.

And here’s the question I’m asking, I guess - does additional granularity and intelligence actually provide us with better answers? And I could tell you what I think, but I keep coming around to the fact that what I think just doesn’t matter, at least until I have data to support those thoughts one way or another.

So I’ll ask - are you saying that until we resolve these issues, BP should stop publishing FRAA? I’m actually willing to listen to that arguement (although I think it’s broadly amusing that it’s the exact opposite of the original premise). But let me ask - does the same line of reasoning extend to PMR? DRS? UZR? What criteria are we using to decide which defensive metrics “should” and “shouldn’t” be published?


#58    terpsfan101      (see all posts) 2010/07/05 (Mon) @ 03:13

Colin, is there anyway you can make your spreadsheet downloadable. I’ll add retrosheet ID’s to my UZR spreadsheets (post #45) so comparisons can be made.

Colin says:

“To sort of bring this back around, in essence there are two components to measuring fielding:

1) How many plays did a player make, and
2) How many chances did he have to make plays?”

End quote

The PBP metrics only have a small advantage on the first component. You can figure out how many plays an infielder made by looking at his assists (and subtracting DP’s turned for 2B and SS). For outfielders, putouts will give you the exact number of plays made for the range component. You can use outfielder assists for an arm rating, but the PBP metrics have a big advantage in this area. They also have a big advantage in determining plays made by first-basemen. In my fielding runs metric, I followed UZR’s parameters of what constitutes a play. UZR ignores infielder popups, so I didn’t use infielder putouts in my formulas. And the data I used to figure my formula for unassisted putouts didn’t include popups caught by firstbasemen.

Now, the second component is where the PBP metrics outclass themselves from the non-PBP ones. The non-PBP metrics can only infer opportunities indirectly (by using data on batter/pitcher handedness, GB/FB ratio, and the total number of BIP to adjust the number of expected putouts, assists), whereas the PBP metrics can determine this directly, although there is some disagreement among the systems in what constiutes an opportunity.


#59    Peter Jensen      (see all posts) 2010/07/05 (Mon) @ 07:40

MGL - A couple of questions.  BIS has stated that they began collecting hang times last year.  Does the data that you currently receive from BIS have hang times for outfield hit balls?  You and Greg also had a project to collect that data from 2008. Have you incorporated this information into UZR?  Colin has a concern about bias in the hit ball classifications.  You (MGL) are, as far as I know, the only person that has access to Retrosheet, BIS and STATS data.  Would it be possible for you to compare the hit ball classifications from all three datasets on a park basis to determine the extent of the disagreement between the different data providers and whether there is bias by park?  Would that be within the bounds of your confidentiality agreements?  This is a fairly important question and it appears that you are the only person in a position to provide an answer.

Colin -

1) How many plays did a player make, and
2) How many chances did he have to make plays?

This is certainly the essence of any fielding metric as you state above, but how you convert “plays made” and “plays not made” into runs is also a significant variable in fielder ratings.


#60    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 09:11

does additional granularity and intelligence actually provide us with better answers

Yes, as long as there is “little” bias in the additional data being considered.  The more bias, the more that will cancel out the benefit of the granularity.

So, the question is how much bias is there, and is that bias large enough to cancel out the gain of the granularity.  For under 3 seasons, I would guess that the bias is not large enough, and so, things like UZR reign.  More than 6 seasons, and that moves the others closer to UZR.  Reasonable-enough guesses.

BP should stop publishing FRAA

Not for the reason you state, but simply because there’s no reason to estimate the LHH split that Pedro faced in 2000 when we know EXACTLY the split.  I know the reason that Clay has stated in the past for keeping things “consistent”.  They were wrong then, and they are wrong now.

In addition, why use some combination of PO and A to figure out how many plays to give the 2B and SS and 3B?  The metric is riddled with estimates, when we’ve got the actual data.

Put simply, if the PBP were available today, would Clay (and Bill James) have constructed their metric as it currently is, for 1950-2010?


#61    Guy      (see all posts) 2010/07/05 (Mon) @ 11:08

"If you look at 3-yr UZR....[and] I look at career WOWY (not that it was brought up here), there are no big surprises in rank (just in magnitude).”

This is the big question in my mind:  do the metrics artificially reduce variance (and by how much)?  Tango, how different are the spreads in UZR vs. WOWY?

Obviously, a tighter spread by UZR isn’t necessarily “wrong.” Even with WOWY’s controls, the highest-rated fielders should have somewhat easier opportunities.  However, I do think that the larger the sample, the more UZR should converge on WOWY.  That is, the more seasons of data you have, the less likely it is a fielder faced an unusual BIP distribution (controlling for pitchers).  So I think a good test would be to compare 3-yr to 5-yr UZR samples (or 2-yr to 4-yr), to see if the UZR-WOWY difference shrinks as sample size increases.  My own guess is that it doesn’t shrink very much, resulting in “over-regressing” of very good and very bad fielders, but I could certainly be wrong about that. 

Complicating this is that WOWY includes a lot of BIP ignored by UZR.  So ideally, for infielders you would use “GB-WOWY” to assess UZR and other metrics.  Looking at WOWY may also tell us that infielders do vary in their ability to handle LDs and/or foul popups (as we discussed re: Rolen), which would be important.


#62    Matt Swartz      (see all posts) 2010/07/05 (Mon) @ 12:33

I don’t really have anything to add but I want to get an email when someone says something!  Interesting stuff.


#63    Matt Swartz      (see all posts) 2010/07/05 (Mon) @ 12:36

I don’t really have anything to add but I want to get an email when someone says something!  Interesting stuff.


#64    Peter Jensen      (see all posts) 2010/07/05 (Mon) @ 14:04

Guy - I thought we had determined in the Rolen thread that UZR’s method of sharing fractional outs with adjacent fielders for plays made within a zone would have some dampening effect on fielders that were either above or below average.  And I also thought that we established that TotalZone’s method of calculating chances using plays made as an input was theoretically incorrect and created an even greater dampening effect.

Like you, I also would like to see a comparison of a GB-WOWY with UZR.  But I am not sure that Tango is set up or has an inclination to use GB-WOWY, as he seemed reluctant to use any subset of a WOWY based on all BIP.  A GB-LEFT-HALF-WOWY for 3d and short and RIGHT-HALF for 1st and 2nd would be even better.  I am also not sure that even 3 and 5 years are long enough for WOWY to be accurate enough to show a meaningful convergence.

Although I think it is useful to do testing of fielding metrics for the purpose of testing some of the assumptions made in their construction, I don’t think that it is possible to construct a test that would resolve the division of responsibility between pitcher and fielder for the fielding result.  I think it is clear that individual pitchers do have a skill that can affect the “fieldability” of their balls in play.  What is not clear is how much the fieldability is affected and how important that factor is when looking at an entire pitching staff of a team for a single year. 

I am also less than convinced that greater granularity equates to an improved metric even when applied intelligently and without bias as Tango claims.  I think that it is entirely possible that having zones that are more granular than the average error of the observational data being used may result in less accurate yearly results.  But testing this hypothesis would be much more trouble than it’s worth.

Like MGL, I am surprised that people who can live with offensive metrics that vary a dozen runs a year or more for a player think the sky is falling if a defensive metric has the same descrepancy.


#65    MGL      (see all posts) 2010/07/05 (Mon) @ 14:19

"Does the data that you currently receive from BIS have hang times for outfield hit balls?  You and Greg also had a project to collect that data from 2008. Have you incorporated this information into UZR?”

I have no hang time data from BIS.  I have not incorporated the hang time data from what Greg and I did into UZR.  We only have 2 years (I think - maybe 3) and we only have deep fly balls.  Of course something is better than nothing.

I have STATS data from a few years back - nothing from the last few years.  I suppose I could compile some comparisons.

Biases and inaccuracies in the data really don’t concern me all that much.  As I said in my previous post, whether a few balls that are ambiguous are classified as ground balls, line drives, or fly balls, or there is inaccuracy or bias by a few feet here or there is NOT going to make much difference in the numbers when the smoke clears, especially in large samples (where the inaccuracies, and even some of the biases, will tend to even out)

“If you believe UZR is better than +/-, has anyone ever showed that?  My investigations have not shown one to be much better than the other.  (They both look equally poor.)”

Ridiculous statements like the one above in parentheses really do not even dignify a response, and I don’t say that as the UZR author. I would say that about any reasonable metric whether I had anything to do with it or not, and I include Dewan’s (which I have nothing to do with of course) in that editorial.

And that, BTW, is NOT a personal attack.  There have been exactly zero personal attacks on this thread and rarely one on this blog.  Whenever someone starts with the nonsense about personal attacks and insults, which certain persons seem to do, I take that as an attempt to deflect the discussion away from substance, whether it is meant that way or not.

As far as the agreement of UZR and DRS, here are the top and bottom 20 (by total UZR of all qualified fielders) from FG. The first number is DRS without their “HR saved (rHR)”, and the second is total UZR.

Franklin Gutierrez 29, 31
Evan Longoria 17, 18
Carl Crawford 24, 18
Chone Figgins 31, 17
David DeJesus 5, 15
Jack Wilson 27, 15
Adrian Beltre 21, 15
Ryan Zimmerman 22, 14
Hunter Pence 19, 12
Elvis Andrus 14, 12
Casey Blake 8, 12
Placido Polanco 7, 12
Juan Rivera 20, 12
Mike Cameron 1,11
Chase Utley 13, 11
J.D. Drew 4, 11
Ian Kinsler 22, 10
Michael Bourn 12, 10
Pedro Feliz 5, 10
Nelson Cruz 11, 10

Avg. DRS =16
Avg. UZR =14

Worst UZR

Jermaine Dye -18, -22
Brad Hawpe -9, -20
Yuniesky Betancourt -19, -17
Vernon Wells -12, -17
Dexter Fowler -15, -16
Orlando Cabrera -33, -14
Andre Ethier -2, -13
Ryan Braun -16, -13
Miguel Tejada -16, -12
Garret Anderson -6, -11
Alberto Callaspo -11, -11
Luis Castillo -11, -11
Michael Cuddyer -6, -10
David Wright -13, -10
Dan Uggla -8, -10
Carlos Lee -5, -10
Jacoby Ellsbury -9, -8
Bobby Abreu 1, -8
Chase Headley -3, -8
Kosuke Fukudome -5, -8

Average DRS: -11
Average UZR: -12

Sometimes RMSE and correlations can be deceiving and/or misleading.  To say that these two metrics horribly disagree or some such thing is so disingenuous, I don’t even know what to say.

In fact, I’ll ask our readers to simply take a look at the above numbers (I realize that these are only a subset of both systems, but I did NOT cherry pick them), and with as little bias as possible, please answer the question, “These two metrics...”

1) Agree very well.
2) More or less agree.
3) Agree/disagree a little.
4) Disagree a lot.
5) Do not seem to agree at all.


#66    Peter Jensen      (see all posts) 2010/07/05 (Mon) @ 15:15

Biases and inaccuracies in the data really don’t concern me all that much.  As I said in my previous post, whether a few balls that are ambiguous are classified as ground balls, line drives, or fly balls, or there is inaccuracy or bias by a few feet here or there is NOT going to make much difference in the numbers when the smoke clears, especially in large samples (where the inaccuracies, and even some of the biases, will tend to even out)

Even though I said that this is a fairly important question, I tend to think you are correct in the above assessment that it will not really amount to a big deal.  I meant that it is an important question because it has been raised as a possible problem and until a comparison is made it will be an unanswered issue for many people.  Even if the comparison is only BIS to Retrosheet I think it would be helpful. 

I think the two metrics more or less agree.  I think most PBP fielding metrics will more or less agree.  For a full time fielder who gets between 400 and 800 balls hit in his zone in a year, there are only about 5% of those that differentiate the best from the worst fielders.  As Colin noted above, calculating plays made is pretty straight forward the only real issues are calculating chances and assigning run values.  You can have pretty bad data and make some pretty big errors in methodology and still will be pretty close to every other fielding metric.


#67    Colin Wyers      (see all posts) 2010/07/05 (Mon) @ 15:41

Oh come on, MGL. By your own admission you’re a very smart person. So how can you say “there have been exactly zero personal attacks on this thread” after saying something like:

I don’t know why some people are obsessed with the fact that they don’t agree as much as they would like. I guess it is because they just don’t understand how they both work. Or they just like to be contrarians. Probably both.

So if there have been exactly zero personal attacks, can you clarify - was that supposed to be a critique of my data or my methodology? Because the actual words you used are pretty clear in critiquing either my intellect or my motivations - “probably both,” actually.

Tom insists that this personal disagreement could be easily hashed out - maybe it could, maybe it couldn’t, I don’t care. I’m not here to be friends with MGL. But I am trying to have a frank and open discussion about the data and the way it’s being handled.

Now, MGL says “Biases and inaccuracies in the data really don’t concern me all that much.” And that’s his prerogative. But I don’t think that all the valid lines of sabermetric inquiry are the ones that interest MGL. Now, when he goes on to say, “whether a few balls that are ambiguous are classified as ground balls, line drives, or fly balls, or there is inaccuracy or bias by a few feet here or there is NOT going to make much difference in the numbers when the smoke clears, especially in large samples,” well, sure, if you want to ignore what bUZR and sUZR are actually saying.

But MGL is the only person with access to both the data and methods used to calculate UZR. And so he’s the only one that can answer the question of why UZR disagrees with itself so much regarding, say, Ichiro. Or Chipper. But he doesn’t care to. And… that’s fine for him. But then I must strenously disagree with Tom when he says that we ought to be using UZR to evaluate fielders. It’s a totally baseless claim.

And when Peter says:

You can have pretty bad data and make some pretty big errors in methodology and still will be pretty close to every other fielding metric.

he could well be right. But I don’t think he follows the implications of that statement to the logical conclusion - if agreement with fielding metric occurs in large samples regardless of the batted-ball data used or the method used to process it, and we have no way of knowing which data or methods are correct, why not simply restrict ourselves to evaluating fielding using the simplest possible method? What is the extra data and processing in UZR providing us at all?


#68          (see all posts) 2010/07/05 (Mon) @ 16:11

I think the two metrics more or less agree.  I think most PBP fielding metrics will more or less agree.

If the standard is, are UZR and DRS better than throwing darts, then sure, they’re great systems.  If the standard is, are they better than Total Zone and RZR and FRAA, I don’t know, and I don’t think anyone has demonstrated that they are.

If you take 2009, the average difference between DRS and UZR for “qualified players” on the Fangraphs leaderboard is 6.5 runs.  The average difference between UZR and TZ is 7.4 runs.  The average difference between DRS and TZ is 7.9.

MGL notes that the 2009 UZR leaders average +16 in DRS and +14 in UZR and the trailers average -11 in DRS and -12 in UZR.  They also happen to average +12 in TZ and -13 in TZ.

If the standard is that these systems are better than no fielding information at all, then absolutely yes.  But aren’t UZR and DRS claiming to be better than systems like TZ?  That’s the claim I’m questioning.  When I say “equally poor”, I mean equally poor at giving us the improvements they are claiming over the existing systems.

Somebody coming up with a new run estimator is likely going to have a statistic that does pretty well at measuring offensive output.  But the standard it will be judged against is whether it is better than what is already in existence, not whether it “more or less” agrees with OPS or EqA or wOBA.  If it uses complex calculations, the author would be asked to justify why that was necessary and whether it added to the accuracy of the results.

I would think that’s a legitimate expectation of a fielding system, too.  Getting in the ballpark with a very complex system doesn’t cut the mustard when much simpler systems based on freely available data also get in the very same ballpark.

If the superior reputation of UZR and +/- can’t be backed with data, that reputation doesn’t mean much to me, and it shouldn’t mean much to the other readers of this blog, where evidence, not reputation, is supposed to be king.


#69    David      (see all posts) 2010/07/05 (Mon) @ 16:20

There are several players among the 40 that MGL listed that have a difference of 10 or more runs in the two metrics he provided.  Is that more or less the same?  I sure don’t think so.  I don’t even consider that to be disagreeing only a little.  That’s a significant amount of value for a significant number of players.  I think they disagree a lot.


#70          (see all posts) 2010/07/05 (Mon) @ 16:24

Since MGL has questioned my motives, let me say this.  I came into this investigation back in April expecting to find that UZR was the best system out there, that it was better constructed than the competitors, and that advantage would enable to show superior results.  That may still in fact be true, but I haven’t been able to find any evidence that shows that.

MGL, if you don’t feel like responding to questions about UZR, you can just say that.  Say you’re tired of the subject, bored of defending UZR, annoyed that people are questioning the value of something you have freely given to the community, etc.  That’s fine.  Or whatever other reason you may have.  But impugning my motives is incorrect and rude.


#71    d      (see all posts) 2010/07/05 (Mon) @ 17:22

Gee. I’ve been reading all this back and forth about the fielding systems for a while in these recent threads here. It’s facinating, but it really leads me to start wondering if the best system out there might be the Bill James Defensive Misplays and Defensive Good Plays, from the last Fielding Bible, and with some intelligent weighting and opportunity factor.


#72    MGL      (see all posts) 2010/07/05 (Mon) @ 18:26

"But aren’t UZR and DRS claiming to be better than systems like TZ?”

I really don’t know. I would assume so.  I don’t recall trumpeting UZR over any other system. I may have, but it is not really my intention, and I don’t really care.

That’s the claim I’m questioning.  When I say “equally poor”, I mean equally poor at giving us the improvements they are claiming over the existing systems.”

That is fair enough.  I interpreted it as, “They are both poor metrics.”

“MGL, if you don’t feel like responding to questions about UZR, you can just say that.  Say you’re tired of the subject, bored of defending UZR, annoyed that people are questioning the value of something you have freely given to the community, etc.”

That is 100% correct.

And I find many of the criticisms particularly unintelligent, without singling anyone out.  There is absolutely no doubt in my mind that some of the regulars like to criticize for no other reason than they like to criticize, which I find particularly annoying. I am not singling anyone out on that either.

Not only have I freely given it to the community, but I have given the exact methodology, so anyone can repeat and test it on any set of data they happen to have access to.

Frankly, I don’t care whether someone uses FRAA, TZ, ZR, UZR, DRS, or what have, in order to evaluate fielding.  As long as the proper regression is done, any of them are just fine.


#73    Kincaid      (see all posts) 2010/07/05 (Mon) @ 18:29

MGL, if you select your lists as the best and worst according to UZR, then you are assuring that your list should be picking players who rate better as a group or worse as a group in UZR than in another system.  If the systems agree, you should not expect to pick the top 20 in one and then look at their rating in the other and get a higher rating for the group.  They appear to agree well based on just the lists you provide, if they were randomly selected, but that could be because while the 2 metrics disagree about the spread of talent in the league, your lists were cherry picked (perhaps not intentionally) to inflate the ratings of UZR relative to the DRS ratings.  That they appear to agree could just be 2 biases (the bias of DRS having a greater spread and the bias of selecting players by who has the best or worst UZR) canceling each other out.


#74    Tangotiger      (see all posts) 2010/07/05 (Mon) @ 19:49

If the standard is, are they better than Total Zone and RZR and FRAA, I don’t know, and I don’t think anyone has demonstrated that they are.

I have demonstrated that UZR and probably TZ is better than FRAA: there is nothing that FRAA does that UZR and probably TZ doesn’t already do.

And if FRAA chooses to do something that they don’t do, UZR and TZ could easily do it.  That’s because FRAA limits itself to seasonal data.

What MGL could, and should, do is a UZR1 and UZR2 to make the distinction: UZR1 would be based on all the objective parameters and some of the subjective parameters.  UZR2 would be UZR1 plus all the more questionable subjective stuff, like batted ball speed and anything else that one might think where the granularity of the data is too granular.  You can even make it so that UZR1 would use Retrosheet zones, while UZR2 uses the smaller STATS/BIS zones.

MGL cuts out the middle man and goes straight to UZR2.

But seriously, exactly who here is claiming that FRAA is doing something that UZR isn’t doing?  And what is FRAA doing?

TZ could be considered a poor man’s UZR (i.e., UZR1).  Exactly what does FRAA do that TZ doesn’t do?  And if FRAA is doing that, TZ could easily do it (if it chose to).

I want to hear exactly what FRAA does that makes it not a subset of the other more advanced metrics.


#75    Colin Wyers      (see all posts) 2010/07/05 (Mon) @ 20:47

Ok Tom, let’s flip this around and examine a similar claim (one which I must emphasize I am *not* making):

“I thought I had already proven PECOTA has to be better than Marcels - what is Marcels doing that PECOTA isn’t?”

Without evidence, would you let me make that claim unopposed, or would you call bullshit?


#76    MGL      (see all posts) 2010/07/05 (Mon) @ 21:44

For what it is worth, I am always careful about making sure that more granular data is not “overused,” if that makes any sense, just in case the data is not that good.


#77    MGL      (see all posts) 2010/07/05 (Mon) @ 21:48

Kincaid, I don’t really understand your post, but you are right that the best and worst UZR SHOULD be better and worse than comparable numbers (for the same players) in any other system.  The reason they are not, however, in this case, is that UZR and DRS uses the same data and essentially the same methodology.  One more piece of evidence that the results of both systems are quite similar, as they should be.


#78    Chris Dial      (see all posts) 2010/07/05 (Mon) @ 22:32

but I have given the exact methodology,

Link?  Because I don’t believe that all the adjustments you make are available.


#79    Kincaid      (see all posts) 2010/07/05 (Mon) @ 22:39

Or, the reason the DRS for the group is not less extreme in this case is that DRS has a wider spread than UZR, and that cancels out the effects of the selective sampling.

Are you saying that if you took a list of qualified players from FanGraphs’ leaderboards, and then picked 20 random players from the list (as opposed to selectively sampling the top or bottom 20 from the metric with the smaller spread), you would expect the same results as you got using your sampling method, and that basically you think the average UZR and DRS for the group would end up being within 1 or 2 runs of each other on average?  Or, if you selected by the top 20 in DRS instead of the top 20 in UZR, you would still get the two metrics for the group to come just as close?


#80          (see all posts) 2010/07/05 (Mon) @ 22:53

Don’t forget that TotalZone comes right there within 1 or 2 runs of the other two systems, too.  So is that evidence that it, too, uses essentially the same methodology?


#81          (see all posts) 2010/07/05 (Mon) @ 23:25

I’ve reported some of these findings elsewhere previously (Twitter, THT Live), but let me repeat and elaborate with greater detail.

Using 2003-2009 data for DRS and UZR from Fangraphs, and looking only at players who switched teams from one year to the next and played at least 400 defensive innings in each year, and here are the year-to-year correlations that I found.

UZR-Range runs/inning, year2 vs. year1
Right field: r=.05, slope=.06, n=26
Center field: r=.05, slope=.07, n=34
Left field: r=.23, slope=.29, n=23
Third base: r=.38, slope=.32, n=21
Shortstop: r=.41, slope=.44, n=23
Second base: r=.18, slope=.13, n=26
First base: r=.13, slope=.13, n=21

DRS-Range runs/inning, year2 vs. year1
Right field: r=.01, slope=-.01
Center field: r=.07, slope=.11
Left field: r=.35, slope=.44
Third base: r=.16, slope=.15
Shortstop: r=.63, slope=.56
Second base: r=.17, slope=.15
First base: r=.07, slope=-.06

What makes the left side of the field perform so much better than the right side of the field in year-to-year correlations?  It’s not simply the number of chances, because CF and 2B get more chances than LF and 3B.


#82          (see all posts) 2010/07/05 (Mon) @ 23:29

Obviously I should look at the same for Total Zone.  I have not done so.  I don’t think the Total Zone data was available yet on B-Ref when I did this comparison on the Fangraphs data.


#83          (see all posts) 2010/07/05 (Mon) @ 23:39

Let me expand, too, on the number of chances for the different pools of players.

Total BIZ year1, BIZ year2
Right field: 6743, 6155
Center field: 9147, 8217
Left field: 5330, 4598
Third base: 4669, 4196
Shortstop: 7029, 7338
Second base: 6655, 7451
First base: 2856, 2816

Divide by n (the number of players in the sample) listed in post #81 to get the average BIZ per player.


#84    Rally      (see all posts) 2010/07/05 (Mon) @ 23:40

Tango, you may want to move this into another thread, but TZ has been upgraded:

http://www.baseballprojection.com/articles/tz_hitlocation.htm

Colin, thanks for finding the old FRAA data.  Is there any reason why that isn’t more accessible on the searchable player pages?  Has BP decided not to display that detail anymore and just had not gotten around to deleting the old cards yet?
Or will it be there but the new cards are still works in progress?


#85    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 00:06

“I thought I had already proven PECOTA has to be better than Marcels - what is Marcels doing that PECOTA isn’t?”

Without evidence, would you let me make that claim unopposed, or would you call bullshit?

If you tell me that PECOTA does do the three things that Marcel does (plus more), then PECOTA would be better than Marcel.  But PECOTA doesn’t do that.  Marcel is not a subset of PECOTA, nor of ZiPS.

Marcel however does seem to be a subset of Chone.  And it does seem to be a subset of MGL’s system.  And so, in the long run, Marcel will lose to those systems.

So, change PECOTA to Chone or MGL, and your statement is accurate and is exactly what I am talking about.  Great analogy actually.


#86    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 00:30

Well, this is taking on a tangent.

I presume the three things are weighted average of past performance, regression to the mean and an age adjustment? Because PECOTA does all three of those.

So, change PECOTA to Chone or MGL, and your statement is accurate and is exactly what I am talking about.  Great analogy actually.

So you’re saying that if someone created a new projection system, very similar in conception to the Marcels, that included more data than the Marcels (some form of MLEs, say) that we can automatically say it’s PROVEN to be better than the Marcels, without testing it first?


#87          (see all posts) 2010/07/06 (Tue) @ 00:44

So you’re saying that if someone created a new projection system, very similar in conception to the Marcels, that included more data than the Marcels (some form of MLEs, say) that we can automatically say it’s PROVEN to be better than the Marcels, without testing it first?

Yes.  Take Oliver for example.  Its projection of Strasburg is proven to be better than Marcel’s. smile


#88    MGL      (see all posts) 2010/07/06 (Tue) @ 00:55

"Or, the reason the DRS for the group is not less extreme in this case is that DRS has a wider spread than UZR, and that cancels out the effects of the selective sampling.”

Sure, that is probably true.  I don’t disagree with the fact that the best in one system should be better than the comps from any other system even if both systems are great and are similar.  Same for the worst.  It is the same thing as taking the best players in one year and looking at those same players in another year.  The best or worst in one system will regress when you look at the same players in another system. How much they will regress depends on two things:  One, how similar the systems are - the more similar, the less they will regress.  Two, the spread of each system, as you point out - the more spread, the more one system will have extremes. If in fact DRS has a larger spread, then when we look at the top and bottom of DRS, we should see a fairly large regression for the same players in UZR even though the two systems are using the same data and are similar in methodology.

So what?

“Are you saying that if you took a list of qualified players from FanGraphs’ leaderboards, and then picked 20 random players from the list (as opposed to selectively sampling the top or bottom 20 from the metric with the smaller spread), you would expect the same results as you got using your sampling method, and that basically you think the average UZR and DRS for the group would end up being within 1 or 2 runs of each other on average?  Or, if you selected by the top 20 in DRS instead of the top 20 in UZR, you would still get the two metrics for the group to come just as close?”

Yes, I am. Why don’t you try it?

In fact:

The top 20 in DRS avg: +18.  Those same players average +11 in UZR.  16 of the top 20 players in DRS are at least +5 in UZR.

For the worst, the avg DRS is -14 and the same players in UZR are -11.  17 out of 20 players in the bottom 20 in DRS are at least -6 in UZR.

A random selection of 20 guys in DRS averaged +1 and those same players averaged +1 in UZR.  Of course that would be the case with any two systems (any randomly selected players would have the same avg numbers in both systems)!

Of those 20 randomly selected players, every player who was at least +3 in DRS was also at least +3 in UZR.  And of 8 players who were at least -3 in DRS, 7 of them were at least -3 in UZR and the 8th was -8 in DRS and -1.3 in UZR.

How about we just cut the crap.  The correlation coefficient between UZR and DRS in 2009 among all qualified fielders, N=131, is .79.

If that is “disagree” or “horribly disagree,” then I am a monkey’s uncle.

Can we please put to rest the notion that, “Two systems that use the same data can’t even agree?”


#89          (see all posts) 2010/07/06 (Tue) @ 01:14

How about we just cut the crap.  The correlation coefficient between UZR and DRS in 2009 among all qualified fielders, N=131, is .79.

You’re not cutting any crap.  That’s been found and stated before.  I didn’t consider that a point in favor of UZR and DRS.

The correlation coefficient between UZR and TZ in 2009 among all qualified fielders is .69, for goodness sake.  And between DRS and TZ it is .72.

One would think that systems using the very same data would get a lot closer than systems using completely different data.


#90    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 01:29

Or how about this. Take the RZR data, and figure a plus-minus for it, based on the average RZR at that position. Then figure the average OOZ/Inn and do a plus minus for that. Combine them.

For qualified starters in ‘09:

Correl with UZR: 0.65
Correl with DRS: 0.76

So what it comes down to is this:

* To the extent that DRS and UZR agree, it seems that almost ANY treatment of the BIS data that isn’t Fielding Pancake Flops will render roughly that level of agreement.

* In cases where different treatments of the BIS data leads to disagreements (that is to say, cases where the underlying data doesn’t say more than the adjustments that are being made to it) we have no way to adjudicate between the different methods.

* In cases where different treatments of the BIS data leads to agreement, we have no way of telling if the metrics are converging upon the “correct” answer, or if they are all reporting the same incorrect conclusion based upon the biases present in the BIS data.


#91    Kincaid      (see all posts) 2010/07/06 (Tue) @ 01:37

I did try it.  I just didn’t want to assume that you were saying that would make no difference when you obviously know it makes a difference.

Selecting the sample from the top/bottom 20 in DRS makes the comparison look quite a bit different from when you selected based on UZR.  They are definitely not within 1 or 2 runs anymore.  Additionally, you no longer see the alternate metric showing more extreme results than the one for which you selected the most extreme results.  When you selected based on the most extreme UZR results, the metric you selected on was actually 2 runs behind the other metric (which is a red flag that there is some difference going on, not necessarily that they are in strong agreement), and then when you selected based on the most extreme DRS results, it swung to the metric you selected on being 7 runs ahead of the other.  The bottom 20 selected by DRS also showed a larger difference than either of the groups you reported initially, but not by as much.

Whether you want to say that differences of 7 and 3 runs are still the two metrics saying pretty much the same thing is purely subjective, but that is definitely a bigger difference than 1 or 2 runs.  Reporting that the difference should really be 1 or 2 runs because of the first list and then gawking at the differences of 7 and 3 runs saying anything different is not very objective.

As I said, I did test the random samples as well.  I pulled the qualified fielders from FanGraphs for each of the past 3 seasons (2007-2009), N=393, average innings 1183.  Then I wrote a VBA macro for Excel to ensure I actually was looking at random samples (I can post it if you want to double-check the code).  The macro randomly selected 20 players from the list, making sure to not select the same player twice, and then figured the average UZR and DRS (with HR runs removed) for the group of 20.  I set the macro to pick 1000 random groups of 20.  The average difference between UZR and DRS for each group of 20 was 4.8 runs, RMSE 6.3 runs.

I don’t care if you want to say that is still strong agreement or not, but pretending that your selectively sampled list before had no issues and should be legitimately taken as more accurate than the numbers other people have cited, or that a 1-2 run difference is actually what we should expect from a randomly sampled group of 20 just because that is what those two lists showed is pretty disingenuous.


#92    Nick Steiner      (see all posts) 2010/07/06 (Tue) @ 02:32

Mike/89

I don’t think showing that TZ and UZR and TZ and DRS has nearly as high of a correlation as DRS and UZR means that much.  For one, a difference of .10 points of R could very well be significant.  Because correlation coefficients are essentially unitless, there is no way of know how significant that difference is.  Besides, your test shows that UZR and DRS *do* have a higher correlation than UZR and TZ and DRS and TZ.  That shows that there is something extra involved in the more granular methodology (not necessarily improvement).  Again, how much extra is still unknown and to be honest trying to subjectively interpret correlation coefficients does nothing for me. 

Your test of repeatably also does not really convince me.  We know that there are massive sample size restraints in annual defensive performance, so comparing year-to-year is picking up a lot of things other than the repeatability of the metric (aging, improvement/decline, outlier performances, etc.)

Breaking it up into only players that switched teams makes the sample size even smaller to tease out a higher correlation.  That certainly doesn’t invalidate the metric IMO. 

If you have the data handy could you please look at all position players who switched teams and run a correlation of year-to-year wOBA among qualifying players?  I would bet that number would be similar to the one with UZR.

Here is my view on the UZR vs. TZ debate (which I guess is what we are having now).  UZR is theoretically better than TZ and includes the same inputs and more.  And all of the tests comparing the two are ultimately inconclusive and their results are interpreted very subjectively.  So until we can get a better test of the metrics, I think that UZR has to be considered the best because it is better constructed - theoretically.  I guess that gels with Tom’s view.  I know it’s unsatisfying for guys like Colin and Mike to accept, so I hope they are able to come up with a good test of the metrics (I still like my WOWY idea).


#93    Nick Steiner      (see all posts) 2010/07/06 (Tue) @ 02:53

Also, I think David in post 5 has a good question.  If we are solely looking at WAR for retrospective value (we don’t care how much true talent it shows) how much should we regress single year defensive metrics?  We know that defensive metrics have measurement error in terms of what actually happened, and in most cases it’s likely that that error is going to be on the closer-to-league-average side, so it makes sense to regress UZR a little bit for retrospective purposes.  How much?


#94    Brian Cartwright      (see all posts) 2010/07/06 (Tue) @ 03:19

I made a list of Oliver’s FRAA from 2005 to 2009.
http://spreadsheets.google.com/ccc?key=0Akieb136KCz2dDkzZUdKSG9DOF9BUmVXZ0VvelNQemc&hl=en&authkey=CJ3alcEP#gid=0

Grouped by player & year, so all positions (>=3) are combined. Only includes MLB performances.


#95          (see all posts) 2010/07/06 (Tue) @ 04:59

Mike Fast said:

As far as Dewan P/M and UZR, they do NOT horribly disagree.  They more or less agree.

Wow.  Speechless here…

He also said:

“If you call that “more or less agree”, then we have a very different standard.”

So you said twice that you think that the statement that “UZR and DRS more or less agree” is preposterous.

Then I run a regression for all qualified fielders in 2009 and the “r” comes out as .79.

I am the one that should be speechless.

So, you are saying that a correlation of .79 for one year does not qualify as “more or less agree?”

Then you say:

That’s been found and stated before.  I didn’t consider that a point in favor of UZR and DRS.

Quoting the correlation has NOTHING whatsoever to do with how good or bad UZR or DRS is. It has everything to do with how well they agree or don’t agree.

You mocked me when I said they more or less agree, so I ran a correlation and it came up 79.

How much higher would you need for something to “more or less agree?”


#96    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 07:29

Because PECOTA does all three of those.

No it doesn’t.  It discards parts of those things in favor of extra weighting of its similar players.

So you’re saying that if someone created a new projection system, very similar in conception to the Marcels, that included more data than the Marcels (some form of MLEs, say) that we can automatically say it’s PROVEN to be better than the Marcels, without testing it first?

I have already said this:

As I’ve stated many times in the past (but not in this thread), as long as the extra things being done are intelligent things, then UZR and +/- cannot lose accuracy.

So, if you tell me that MGL or Rally or Brian or Colin is doing Marcel 2.0, then, yeah, I would just need minimal testing to prove it.  It would not have to be extensive. 

Unfortunately, Brian, as intelligent as he is, seems to be doing something I don’t understand, and therefore, I need more testing.

I’ve done minimal testing on MGL and Rally, and they’ve proven themselves enough that they are some form of Marcel 2.0.

ZiPS is not, and PECOTA is not.  BJ is not, and Shandler is not.

But, seriously, FRAA uses seasonal data, not play-by-play data.  How much testing would you really need to prove that a PBP system that uses purely objective data would be better than FRAA?  FRAA estimates outs and handedness split that we know for sure.

In order for FRAA-seasonal to be better than FRAA-events, we have to accept that an estimate of what happened gives more accuracy than what really happened, in terms of what really happened.

Just on its face, it makes no sense.  At the VERY LEAST, why can’t FRAA be changed to use what really happened?  That’s it.  Just that, instead of this philosophical discussion, can’t FRAA be changed to use actual data?  Can we discuss that for a few minutes?


#97    Rally      (see all posts) 2010/07/06 (Tue) @ 09:05

Colin did mention this:

“That is the data as it appears on the site (as of the time of this post) - as in, the 2005-2010 data uses a different method than the other data.”

The FRAA from 2005 on is using PBP data.  Probably very similar to totalzone, or the fielding runs system that Dan Fox used.


#98    Peter Jensen      (see all posts) 2010/07/06 (Tue) @ 10:23

Rally #84 - Congratulations on your update!  Its a lot of work, but I think you will find it is worth it.  Did you update your past files as well as 2010?  Also, the convention for designating hit ball angles that was adopted here last year by consensus has 3d base at -45 and 1st base at +45.  I think you will find that most people have been using that system since then, reluctantly even me.  Please email me as we have many things to compare between your new system and BZM.


#99          (see all posts) 2010/07/06 (Tue) @ 10:33

Nick/92

I don’t think showing that TZ and UZR and TZ and DRS has nearly as high of a correlation as DRS and UZR means that much.  For one, a difference of .10 points of R could very well be significant.  Because correlation coefficients are essentially unitless, there is no way of know how significant that difference is.  Besides, your test shows that UZR and DRS *do* have a higher correlation than UZR and TZ and DRS and TZ.  That shows that there is something extra involved in the more granular methodology (not necessarily improvement).  Again, how much extra is still unknown and to be honest trying to subjectively interpret correlation coefficients does nothing for me.

My point with the correlation coefficients is that .79 is really quite poor for two systems that are using the very same data.  You’d think they’d have a correlation coefficient of .95 or something like that if they really agreed.  Instead they are in the same ballpark in agreement as they are with a system using completely different data.  If you look at Justin’s old tests, a correlation coefficient in the neighborhood of .75 told us that two systems using different underlying data had pretty good agreement.  If that’s all that UZR and DRS can do, I find that surprising.  If you ran a correlation between wOBA and EqA (or even wOBA and Total Average or OPS) on the same set of offensive data of a similar sample size, what do you think the correlation coefficient would be?  It would surely be well over .9 and very close to 1.

I’ll address your other question in another post.


#100    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 10:33

The FRAA from 2005 on is using PBP data.  Probably very similar to totalzone, or the fielding runs system that Dan Fox used.

Did I miss the announcement on the change in methodology?  All my objections were based on FRAA being based on seasonal-data and estimates thereof.  If this is not the case, then my objections only apply to the previous version of FRAA.

Colin, can you point me to a reference in the change, so I can have an informed opinion?


#101    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 10:45

You’d think they’d have a correlation coefficient of .95 or something like that if they really agreed.  Instead they are in the same ballpark in agreement as they are with a system using completely different data.

That’s unreasonable.  When MGL makes changes to UZR, the correlation from one version of UZR to another does not approach anything close to .95. 

That MGL looks at a bunch of parameters and Dewan looks at a different bunch, some of which intersect, there’s no expectation at all for a correlation to be .95.

I can create two (intelligent) fielding systems right now based on the same underlying data, and I wouldn’t be surprised if I could get r under .70.  I can ignore park in one system, ignore batting handedness in another, create small buckets of zone in one, and smoothed out buckets in another.  I can ignore base/out situations in one and GB/FB tendency in another.  I can baseline against one-year of data or ten years, etc.  I can make a whole bunch of intelligent defensible choices and I can look at that exact same data in different ways to give me different answers enough to get an r under .70.


#102          (see all posts) 2010/07/06 (Tue) @ 10:51

Tango/101,

In other words, you’re saying that methodology makes a HUGE difference in the answers that fielding systems give you.  I agree.  And that’s concerning to me.

I didn’t choose to report the difference in terms of correlation coefficient.  I don’t think that’s terribly illuminating.  MGL was the one who offered it up as evidence.

To me, it’s much more meaningful to say that we’re looking at +/- 7 runs of difference on a season level simply due to methodology.  And in the error from data bias and random error, and the overall error in seasonal fielding measurements is higher than +/- 7 runs.  How much higher, nobody seems to know.


#103          (see all posts) 2010/07/06 (Tue) @ 11:06

Nick/92, here is the same data for UZR as in #81 but this time for players who stayed with the same team from year1 to year2.

UZR-Range runs/inning, year2 vs. year1
Right field: r=.55, slope=.63, n=104
Center field: r=.44, slope=.44, n=115
Left field: r=.56, slope=.60, n=101
Third base: r=.42, slope=.45, n=122
Shortstop: r=.53, slope=.49, n=114
Second base: r=.52, slope=.50, n=113
First base: r=.30, slope=.27, n=119


#104    Peter Jensen      (see all posts) 2010/07/06 (Tue) @ 11:11

Now that Rally has incorporated MLB hit locations into TotalZone we will have another pair of PBP fielding metrics that use the same data.  I am excited aboout comparing BZM and the new Totalzone to see how close we are in general and at what specific points we differ.  That was the whole point for me in creating BZM in the first place.  I hope that a point by point comparison between the metrics will leading to a philosophical discussion of the many methodological choices that are made in creating a fielding metric and lead to improvements in the construction of fielding metrics.  I wish we had more people using the MLB hit locations.  At one point both Brian and Colin had fielding metrics in the works that were going to use the hit locations, but I don’t think that Brian has incorporated them yet, and I don’t know were Colin is in the development of his metric.


#105    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 11:14

Tom, you’ve discussed the new FRAA (not entirely kindly) before:

http://www.insidethebook.com/ee/index.php/site/comments/intentionally_using_less_data/

And I mean we can take the question up again - once you go from seasonal totals to PBP, what data should you use? What data should you maybe consider not using? How should you use that data? And how do you know those decisions are right?

Because, as you say:

I can create two (intelligent) fielding systems right now based on the same underlying data, and I wouldn’t be surprised if I could get r under .70.  I can ignore park in one system, ignore batting handedness in another, create small buckets of zone in one, and smoothed out buckets in another.  I can ignore base/out situations in one and GB/FB tendency in another.  I can baseline against one-year of data or ten years, etc.  I can make a whole bunch of intelligent defensible choices and I can look at that exact same data in different ways to give me different answers enough to get an r under .70.

And so where does that leave us?


#106    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 11:17

Peter, development stalled when I found some issues with how the batted ball vector data behaves at the 1B/2B and SS/3B boundaries that I’ve been unable to reconcile, and that’s shaken my confidence in the usefulness of the vector data. I’ve never seen anyone else mention this issue, so maybe I’m overthinking things. It could be an issue of using too fine a curve for the LOESS, maybe.


#107    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:27

The blueprint for UZR was posted several years ago here, along with the adjustment factors:

http://www.insidethebook.com/ee/index.php/site/comments/mgl_archives/

I don’t understand why anyone would say that UZR is a black box.  He not only tells you the parameters he used, but he tells you enough about the adjustment factors that you can try to replicate UZR.  As an example, quoting MGL’s article:

FB Out Percentages by Fly Ball Depth and Batted Ball Speed

Fly ball depth      Easy   Medium   Hard
all                 .697     .931   .728
deep                .957     .939   .708
medium              .832     .944   .855
short               .591     .882   .877

How is the batted ball speed applied to UZR? Rather than using a park factor type adjustment, I opted to “split” each zone into six separate “sub-zones”, and keep track of player outs and chances and league outs and chances separately in each “sub-zone”. Why six and not three (soft, medium, and hard)? Well, I also “tacked on” the handedness of the batter, which is another important adjustment, as you will see later on. In other words, a fielder’s runs saved or cost is calculated six separate times for each zone on the field. I warned you that there were going to be lots of “rigor versus sample size” issues in the UZR adjustments!

PECOTA doesn’t go into any kind of those details.  Neither does Chone or ZiPS.  Marcel does give you that kind of detail.  UZR is reproducible to some extent, just like Marcel.

Anyway, MGL did do what I said he should do: UZR-Basic and UZR-advanced.

And for the 27 SS in 2002, he shows this:
ZR
UZR-Basic
UZR-Advanced

And I ran a correlation of all three:
.70 ZR to UZR-advanced
.80 ZR to UZR-basic
.85 UZR-basic to UZR-advanced

As MGL said in that article:

If you compare the above charts with those at the beginning of this article, you will see that STATS simple ZR correlates very well with UZR. This suggests that ZR is a pretty good measure of fielder ability, assuming that UZR is the “gold” standard.

And that’s for single-season UZR.  Correlation will obviously go up (systematic biases notwithstanding) as the number of games goes up.  Which is why plain ZR is decent enough for a player’s career.


#108    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:30

To me, it’s much more meaningful to say that we’re looking at +/- 7 runs of difference on a season level simply due to methodology.  And in the error from data bias and random error, and the overall error in seasonal fielding measurements is higher than +/- 7 runs.  How much higher, nobody seems to know.

I agree, but it doesn’t damn either system.  It’s just illuminating that it depends on how you choose to slice the data.

Also, when you did the 7.5 runs difference, is that based on the unadjusted or adjusted Dewan DRS?  Because, as we’ve said, it’s bothersome to present something as above/below average, and possibly have every player above average in a given season.


#109    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 11:31

For qualified starters in ‘09:

Correl. between (TB+BB+HBP)/PA and wOBA: .94
Correl. between ABSO and wOBA: .98

==============================

Tom, have you read the Kevin Maas chapter in Baseball Between The Numbers? Or Silver’s introduction to PECOTA in (I think) the ‘03 Annual? You can read the latter article for free on Amazon with the “Look Inside” feature.


#110    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:44

Tom, you’ve discussed the new FRAA (not entirely kindly) before

I forgot I wrote that.  And I was perfectly kind.  I for example would have no problem repeating everything I said word-for-word to Clay. Not that my kindness level matters anyway.

And I mean we can take the question up again - once you go from seasonal totals to PBP, what data should you use?

At the risk of repeating myself:

At the VERY LEAST, why can’t FRAA be changed to use what really happened?  That’s it.  Just that, instead of this philosophical discussion, can’t FRAA be changed to use actual data?

So, to do FRAA v2.0, the ONLY thing Clay has to do is to stop estimating those things that can actually be counted.  That’s it.  Just that. That’s where Clay can stop.

AFTER THAT, after FRAA2.0 is done, we can have the discussion as to what other data to use (how much overkill it is, etc).  For example, when Pinto did his original PMR, he had included the identity of the batters and pitchers.  He discarded it very quickly because he saw that the results didn’t change much.

In WOWY, the batter and pitcher identities are crucial, and so, I have to have it.  Why?  Because I don’t use zone data, and PMR did use slice data.  So, you need to use some sort of distribution of BIP, either inferred like I do, or recorded by stringers as he (and everyone else) does.

But, Clay could try to estimate that in other ways as well if he wanted.

My original point remains: of those things that are not in doubt, why is Clay trying to estimate it, if we can count it?

My post 8 in that thread you posted summarizes my feelings on the matter:

Yes, there are sound reasons to leaving out data in a metric.

But, there are no sound reasons for leaving out data in this particular metric.  Indeed, Clay in the book reiterates what he has often said as the reasoning for his bias, the using the-one-system approach.

This is wrong.

Suppose, for example, you have plus/minus figures going back only to 1967.  Are you then going to completely ignore plus/minus for all players because you don’t have it for Maurice Richard?

If for example you have saves going back only to 1982.  Are you going to completely ignore the number of shots a goalie faced by Martin Brodeur and Dominik Hasek because you don’t have it for Jacques Plante?

No. If the data that you want would help you in evaluating a player, then you cannot discard that piece of data because you don’t have it historically.

FIP and DIPS intentionally leave out the data because they are only interested in evaluating a player on plate appearances that don’t involve the fielder.  Just as OBP intentionally ignores the fact that a HR is more valuable than a walk, because it is only interested in certain facets of the player’s performance.

For what Clay is doing, it is simply a bad choice on his part to stick to his bias of preferring the single-methodology approach.

Colin, exactly what is it that I have said that you are disagreeing with?  Or, are you not disagreeing at all with what I’ve said?


#111    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:46

Peter, development stalled when I found some issues with how the batted ball vector data behaves at the 1B/2B and SS/3B boundaries that I’ve been unable to reconcile, and that’s shaken my confidence in the usefulness of the vector data. I’ve never seen anyone else mention this issue, so maybe I’m overthinking things.

Colin, if you are talking about the plotting of the batted ball by the stringers would be conditional (subconsciously biased) on where the fielders are positioned and/or made the play, then this is a known discussion issue.

Are you talking about something else?


#112          (see all posts) 2010/07/06 (Tue) @ 11:47

Also, when you did the 7.5 runs difference, is that based on the unadjusted or adjusted Dewan DRS?  Because, as we’ve said, it’s bothersome to present something as above/below average, and possibly have every player above average in a given season.

It’s based on what was on Fangraphs in early April 2010, with no adjustments.  However, at the season level, DRS and UZR should be using the same baseline, according to what Ben said.  The different baselines come into play when you are comparing partial seasons.

Let me check…

Huh, DRS for 2009 sums to +424 runs for MLB, whereas UZR sums to -1 runs.  Is there still something screwy with the baseline for DRS even at the season level?  The rolling baseline that Ben described would not address this problem.  Does anyone else see the same thing?

Making the adjustment to zero out DRS for the league makes a small change to the RMSE between the two systems.

For all players with >1200 defensive innings, it moves it from +/- 7.6 runs with DRS unadjusted to +/- 7.2 runs with DRS adjusted.  For all players with >900 defensive innings, it moves it from +/- 6.9 runs unadjusted to 6.5 runs adjusted.


#113    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:48

And so where does that leave us?

Room for improvement and discussion, as long as you start with the pbp data.


#114    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 11:51

Tom, the data that Clay left out of nFRAA is the hit location data. You may disagree with the reasons Clay stated for excluding it. I don’t agree with the reasons Clay stated for excluding it.

But that’s the sort of data that’s at the crux of this discussion - how good is that data? How much accuracy does it add to the metrics? How much bias does it add to the metrics? Does the added accuracy outweigh the added bias?

And I think before we proceed to determine what metrics are “good” or “bad” we need to first determine the validity of that data. That’s all I’m saying.


#115    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:52

Tom, have you read the Kevin Maas chapter in Baseball Between The Numbers? Or Silver’s introduction to PECOTA in (I think) the ‘03 Annual? You can read the latter article for free on Amazon with the “Look Inside” feature.

Yes, I read both.  I remember in one of the annuals, he talked about PECOTA for about 2 or 3 pages, where he listed the ten or so parameters he considered.  That was good.  But, he didn’t specify the how.  He didn’t write it like MGL did so that it was reproducible.


#116    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 11:56

For qualified starters in ‘09:

Correl. between (TB+BB+HBP)/PA and wOBA: .94
Correl. between ABSO and wOBA: .98

What is the relevance here?  You are taking known hits, known extra base hits, and known outs, and you have one system that arranges it one way and another that arranges it another way.  The correlation would have to be in the high r=.9x.

With fielding systems, you are taking known outs (in some systems), estimated outs (in another), and estimated hits (for all systems), and trying to find the correlation.


#117    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 11:56

Mike, when I reported my figures for UZR and DRS, I used rPM for DRS (just the range/fielding component, in other words) and RngR+ErrR for UZR. DRS has some components, like the arm ratings and the home runs saved, that don’t sum up to zero. I tend to ignore those because I don’t know what they mean (home runs allowed relative to all home runs allowed?).


#118    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 12:03

Tom, the data that Clay left out of nFRAA is the hit location data. You may disagree with the reasons Clay stated for excluding it. I don’t agree with the reasons Clay stated for excluding it.

I don’t have (much) of an issue for ignoring hit location data, as long as you compensate for it somewhere.

I’m obviously out of the loop somewhere, as I’m making assumptions that I seem I should not.  Let me start from the beginning.

1. Is FRAA estimating, or using the actual, handedness split data?

2. Is FRAA estimating, or using the actual, unassisted putouts for 1B?


#119          (see all posts) 2010/07/06 (Tue) @ 12:05

Thanks, Colin.  I did the same for what I reported in #81 and #103.  I’m not sure why I didn’t do the same for the overall comparison.  Probably because I did the overall comparison first before I looked at the more granular data.

rPM sums to -11 runs for the league in 2009.  UZR’s RngR and ErrR sum to -4 runs for the league.  Both enough are close enough to zero, with rounding, for me.

New RMSE figures comparing only range between Plus/Minus and UZR are +/- 6.5 runs for players with >1200 innings and +/- 5.8 runs for players with >900 innings.


#120    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 12:06

But that’s the sort of data that’s at the crux of this discussion - how good is that data? How much accuracy does it add to the metrics? How much bias does it add to the metrics? Does the added accuracy outweigh the added bias?

And I think before we proceed to determine what metrics are “good” or “bad” we need to first determine the validity of that data. That’s all I’m saying.

I obviously agree with you, since I have WOWY, which takes the position to ignore all subjective data.  It is not only fair for you to bring up the subjective issue, but a necessity.  So, there’s no debate on this issue (with me anyway).

All I’m saying for my part (the ONLY thing) is for FRAA to not estimate what we can already count, as evidenced by my two questions in the previous post.  If Clay is estimating those things, he’s wrong.  If he’s counting them (in the seasons we have it), then this is the longest 50-post discussion where we simply find out we agree.


#121    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 12:09

For 2005 through now, FRAA is using play by play data. It has actual handedness data, actual unassisted putouts for 1B, etc. It uses batted ball data (it lumps LD/FB/PU into one category, air ball, rather than distinguishing between them separately).  If you want to argue that it should be doing all of this back to ‘89 (where we have batted ball data on all events), yes, it should. If you want to say that the non batted ball portions of it should go back to 1950, now, yes, you’re right.

So on that point, I think we all concur. Is that what you wanted to hear?


#122    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 12:30

Colin, yes, that’s exactly what I wanted to hear. 

This is the longest non-disagreement exchange I’ve ever had, other than my wife and I doing “you’re the schmoopie, no you’re the schmoopie”:

http://www.seinfeldscripts.com/TheSoupNazi.htm

JERRY: Which one you wanna go to shmoopy?

SHEILA: You called me shmoppy. You’re a shmoopy.

JERRY: You’re a shmoopy!

SHEILA: You’re a shmoopy!

JERRY: You’re a shmoopy!

GEORGE: All right, shmoopies...what’s it gonna be? Pick a theater.

Anyway, based on this exchange, FRAA may provide good value from 2005-onwards. 

Did I miss the announcement on the change to FRAA?


#123    MGL      (see all posts) 2010/07/06 (Tue) @ 12:37

I have a new (as of around 3 weeks ago I think) article on FG that pretty much describes everything in UZR, including the 2010 changes.  David A. should have a permanent link to it somewhere in the fielding leaders section, but I don’t think he does. He needs to update his glossaries too.

HR saved does not scale to above/below average for some reason. Basically, if you save a HR, you get some credit for X runs (I am not sure how many - I guess it should be around 1.7), regardless of how many HR the average player at that position saves, which is ridiculous of course.

It should NOT be added to PM runs to give us DRS. That is horrible!  I wish that were fixed.  David A. should scale HR saved to above/below average and THEN add it to PM runs if Dewan does not want to do that.

RMSE tells me nothing that means anything to me.  If Mike had said the RMSE error between UZR and DRS were 5.2 runs or 6.8 runs or 7.3 runs or 3.9 runs, I can guarantee that Tango, me, and everyone else would think, “OK, that’s nice.” RMSE are nice to compare one thing to another, but by themselves, unlike something like “r”, doesn’t really mean anything in terms of whether two things agree with one another.  The reason is this:

If I have a metric with an enormous spread, say -500 to +500 and I compare it to another similar metric with a similar spread, if the two agree very well with a high correlation coefficient, the RMSE is going to be quite high simply because the SD of the metric is so high.  RMSE divided by the SD of one or both of the metrics might mean something, I don’t know (is that a thing that is ever used?) - but just the RMSE by itself?  No.

And FWIW, I agree 100% with Tango about systems or metrics whereby one uses more (or more granular) data than the other.  As long as the extra data is handled reasonably intelligently, and the data is reasonably good, it is virtually guaranteed that the more complex system is the better one.  I typically would not waste my time testing for that.  Now, how much better is another story…


#124    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 12:49

I am not sure how many - I guess it should be around 1.7

1.6 is actually the correct number, if you presume that
a. long fly outs are not as damaging of outs as regular fly outs (i.e., runners move up)
b. a HR saved can still lead to extra base hits (not necessarily outs)

So, if you knock out about .05 runs for a) and .05 runs for b), 1.6 is better than 1.7.


#125    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 13:05

I did a simple little test, where I took 40 numbers from -19 to +20, and then sorted the top 10 randomly.  Then sorted the (new) 6th through 15th randomly, and so on.  I get this:

20 13
19 18
18 17
17 19
16 16
15 10
14 20
13 6
12 11
11 9
10 14
9 5
8 12
7 1
6 2
5 4
4 -4
3 8
2 -3
1 3
0 15
-1 -1
-2 -2
-3 -5
-4 -6
-5 -10
-6 0
-7 -13
-8 -12
-9 7
-10 -16
-11 -11
-12 -15
-13 -14
-14 -8
-15 -9
-16 -19
-17 -7
-18 -17
-19 -18

Is that a good match, or not a good match?  I dunno… seems good to me.

r=.89
r-squared=.78

RMSE=5.5

***

RMSE divided by the SD of one or both of the metrics might mean something, I don’t know (is that a thing that is ever used?)

That’s pretty much what r is, right?

Using the above sample numbers, the SD of either of the two columns is 11.7.  The SD of the differences of the two columns is 5.5.

5.5/11.7, squared, is .22.

One minus that is .78, which is r-squared.


#126          (see all posts) 2010/07/06 (Tue) @ 13:26

It was drilled into my head in my physics training that no measurement has meaning unless you know something about the error associated with that measurement.

What I have provided is an estimate of the error due to methodology in fielding systems, at +/- 7 runs per season for full-time players.

That’s far more meaningful than simply knowing the correlation coefficient between two systems.  Yes, if you also know the SD of the distribution, you can use r and the SD to arrive at the RMSE.  So, I’m not saying r is useless, but on its own, it is not very illuminating.

If you quote WAR, the error on the fielding component is, at a minimum, +/- 0.7 wins due to methodology error.  Of course, the total error on the fielding component is probably much larger, but we know that it is at least +/- 0.7 wins.  Does anyone disagree with this?

Whether you consider that good or bad is really irrelevant.  That’s the beauty of putting a number on something rather than arguing about qualitative descriptions.

If you want to consider +/- 0.7 wins per season a “good” or “acceptable” or “bad” error margin, any of those descriptions are fine with me, I guess.  That’s not really important.

The far more important questions are whether that estimate is accurate and what the error due to the data bias is.  I’d far rather entertain a discussion about the objective numbers than whether we want to subjectively label them good or bad.  Subjective labels don’t help us quantify fielding.


#127    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 13:38

I see the 6 or 7 run RMSE and… I have no reaction.  Until I ran my simple test above, I did not appreciate how close an RMSE of 5.5 is, when the observed SD is 11.7.

Mathematically, I should have realized it was very close.

When the spread in the observed difference is 87% as wide as the standard deviation of the observations, the correlation is r=.50.

sqrt(1-.87^2)=.50

Mike is reporting an RMSE of around 6.5.  The SD of all the observations is, I dunno, 10?  That gives you r=.76.  (Is that the correlation coefficient you have Mike?)

Seems to me that I should not have much of a reaction if Mike reports RMSE of 6.5 between two systems using the same data that tries to estimate what an average player would do under two separate systems.


#128          (see all posts) 2010/07/06 (Tue) @ 13:57

What reaction you have I suppose depends on your expectations.  That’s why we finally have to quit talking about opinions and reactions to the data.  Of course everyone will have a different reaction because they had different expectations coming in.  I expected the methodology error to be much lower, so 7 runs seems bad to me.  But that’s irrelevant.

If we can come to an agreement about the numbers, that’s good enough.

(As to the correlation coefficient you asked about in #127, the exact number depends on exactly which sample of data you are talking about, but yes, that’s the right ballpark.  I think MGL quoted r=.79 for one particular sample, and I think I agreed with that.)

What’s the error on the batting measure in WAR?  The baserunning portion?  The aging adjustment?  And then of course we don’t know yet what the error is due to the fielding data bias.  Somebody may have calculated the error in the fielding measurement due to sample size--I suspect that’s been done--but I don’t know it off the top of my head.

Take all those errors and add them together (in quadrature) and you’ll know the error associated with WAR.

With the information about error sizes in hand, you could also intelligently decide whether you are better off including fielding in WAR or not for a given application, which was the original question in the thread, yes?


#129          (see all posts) 2010/07/06 (Tue) @ 14:01

Let me clarify one point from #128.  You add errors in quadrature if they are independent.  That seems like a decent assumption here, but somebody might show that it’s wrong, so I wanted to note that.


#130    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 14:05

Tom, can you explain where those formulas are coming from? My understanding is that correlation is just the covariance divided by the products of the standard deviation of each of the sets.

The formula for covariance is similar to the formula for variance, which is reminiscent of RMSE:

http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/covariance.htm

In this case the square of the RMSE between the two sets may in fact represent covariance, since the mean for each should be zero.


#131    Jeremy      (see all posts) 2010/07/06 (Tue) @ 15:17

Colin, do you think you could expand on this?

“I found some issues with how the batted ball vector data behaves at the 1B/2B and SS/3B boundaries that I’ve been unable to reconcile.”

Thanks.


#132    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 15:31

Let’s swing this around to the original premise - that using any play by play fielding metric, so long as it is “intelligent” and uses some reasonable sort of data, is better than using no fielding metric whatsoever. Call that the “VORP assumption.”

And let’s lay out some common assumptions for a certain use case for the data. Not all use cases for the data follow these assumptions, but they seem to describe the use case that Tom was getting at with his post:

1) We are talking “value” or “performance,” not talent. We aren’t trying to estimate a player’s performance going forward, although obviously the better we can estimate performance the better we will do at projecting/estimating talent.

2) We are talking, necessarily, of single season “samples” or smaller. Because that’s how baseball is played. For projecting, yes, the single season is essentially an arbitrary unit. But in terms of description is is all that matters.

Okay, so we know that:

1) According to UZR and/or DRS, the spread of fielding performance (SD, which helpfully works like RMSE) for qualified starters is about 11 runs or so. That’s performance, not talent. (I’m using the numbers as a useful illustration, so why they are not meant to be for illustration purposes ONLY, I don’t pretend that I’ve been rigorous in selecting them.)

2) The RMSE between DRS and UZR on the same data for qualified starers is around 7.

So.

sqrt(11^2 - 7^2) = 8.5

In other words, if the error bars from everything ELSE are larger than 8.5 (or I would say even close to 8.5), then the VORP assumption is “correct” - you should prefer to use a metric that does not include fielding over one that does, as the error bars for your measurement are larger than the precision you’re able to use for measuring.

Everything else includes but is not limited to:

1) Errors in kind between UZR and DRS - that is to say, mistakes that both systems are making.

2) Errors in the data. You can subdivide this out again to errors that seperate two data sources and (this is the real pain to deal with) errors in common with ALL sources of batted ball data.

Does anyone have reason to suggest that the sum of the error created by those two things is less than 8.5 runs per season for qualified starters?


#133          (see all posts) 2010/07/06 (Tue) @ 15:48

Colin/132, it’s worth noting that what you say is true for individual players, such as might be the case for evaluating a potential free agent contract.

But in other applications you may be dealing with groups of players, in which case the errors that are not systematic in the same direction for the players in the group (i.e., random), will be smaller for the group than for the individual.

For example, hypothetically, let’s just say we had somehow determined that the error for everything else in the fielding measurement at the season level was 8.5 runs.  Combined with a 7-run error due to methodology would make a total error of +/- 11 runs on the fielding estimate.

For one such player, the hypothetical error is +/- 11 runs.  But for a group of 100 players, the total error is sqrt(11^2 * 100) = 348 runs, such that the average error per player is 3.5 runs.  That is assuming that the error for each player is independent of the error for every other player.  Probably not completely true.


#134    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 15:53

Everything you say is correct, Mike. If we only want to look at how POPULATIONS of players behave, we don’t have to worry as much about any of the particular causes of error, be they random or persistent.

But the original statement, as I read it, was that if we want to talk about how any PARTICULAR player did in, say, a season, we need to use a measure (like WARP or fWAR or rWAR) that includes fielding, and not a measure like VORP that doesn’t. And in that case, the way those errors behave at a population level doesn’t help us any.


#135    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 16:02

As I noted at the beginning:

Colin’s point is made that if you need to regress more than 50%, and if you are going to do 100% UZR (or TZ or FRAA) or 0% those stats, then you are better off using NO FIELDING.

This is the same argument of BABIP, that if you have decided to do 0% or 100% (if you limit yourself to those 2 choices), then 0% is preferable.

What Colin’s point really highlights is that choosing either/or is not the best choice.  That regressing performances should be done, even for MVP stuff, simply to account for the uncertainty of the metric.

You *could* make the argument that using 0% fielding is preferable to using 100% fielding.  But, you won’t be able to make the argument that 0% is better than 1%.  There’s going to be some point, like with BABIP, that you can’t simply use 0%, that there’s going to be some number in between that you should use.

As usual, best way is to be lazy, and use 50%, and then let the rest of us worry about what that number actually should be.


#136    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 16:24

And maybe I’m being too literal here, but again, whenever you say “regression” I run right back and look at this:

VarObs = VarTrue + VarRand + Error

I called error bias earlier, but since we’re being really literal I figured I would cover all my bases.

Regression based upon the number of opportunities, like illustrated in the Appendix to The Book, will take care of VarRand. For a future projection, we do care to eliminate or reduce the effect of VarRand, and so regression takes care of that.

What we are talking about is error. Regression doesn’t help that. Increasing sample size has no effect on certain kinds of errors. Park effects in the data on batted ball types is the example I’ve been using, but it’s one example of a persistent data error. There is the potential for other forms of data errors and other kinds of persistent biases in the methods as well.

I could be way off base here, but I don’t see how regression helps us, when we don’t know what those biases are or their magnitude.


#137    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 16:54

Well, there are two kinds of “errors”: errors in recording the data (of which they may be random and/or systematic), and errors in applying adjustments.

I think you can make a reasonable enough argument that it’s just as likely that that MGL and Dewan are “over adjusting” as they are “under adjusting”.  So, there’s no need to alter the results of MGL or Dewan with regards to errors associated to applying adjustments.

We can presume that there are some systematic errors being recorded.

So, I was wrong to use the word regression (as we use it to apply to true talent of players).  It’s a “regression” of the scoring practices of the stringers.  What I should say is that you split the difference between what they say about the data, and presuming that all players are equals.  So, if UZR says that Frankin Gutierrez is a +20 fielder, then for the purposes of MVP and salary talk, you should at least consider him as having performed as a +10 CF (who may even have been say a +2 true talent player, just for the sake of argument).


#138    Matt Swartz      (see all posts) 2010/07/06 (Tue) @ 17:04

I feel like Colin’s point hasn’t resonated loudly enough here for some reason, but it’s a very good point that I’d like to hear the answer to.  If there is a systematic bias in scorers, then regression does not take into account the issue.  If there are two players who are +20 fielders as measured (and unregressed) by UZR, but the press box is higher up in one stadium, and that makes more plays look like line drives, then you should regress the +20 fielder further (assuming that caught line drives give a defender more runs than caught fly balls).  Maybe I don’t quite understand how UZR is calculated, but it seems like what Colin is seeing is that certain players (and most likely, teams) will be always biased upwards or downwards even with infinite data.  That would mean that you can’t and shouldn’t regress all +X runs or -X runs fielders equally.  Maybe I don’t understand this argument as I’m following this thread for the sake of learning here, but scorer bias seems like an important issue that regression won’t hide any more than regressing HR/FB for hitters without looking at park effects to determine home run skill.


#139    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 17:06

You could just as easily make a reasonable argument that UZR is overadjusting. I would go to some effort to demonstrate, but as we know Chris Dial is paying attention to this thread I’ll go ahead and see if he does it for me.

Except I will add - obviously at least ONE of them has to be adjusting incorrectly, right? They’re using the same raw data - they can’t BOTH be right, can they?

As for the reasoning that we should do 50-50, isn’t that exactly how you tell everyone NOT to handle split credit?

http://www.insidethebook.com/ee/index.php/site/comments/how_to_split/

Or am I missing something that makes this a separate case?


#140    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 17:08

Er, more likely to be overadjusting.


#141    Guy      (see all posts) 2010/07/06 (Tue) @ 17:20

The most likely systematic bias in the data will be exacerbated, not remedied, by regression.  That is the bias toward rating plays as “easier” when they become outs, or when fielders get to them quickly.  Imagine having people rate the difficulty of 200 GBs into the 3B-SS hole from video.  Now, imagine that the fielders are digitally removed, and the video stopped before it’s clear whether the ball reaches the OF, and the plays are scored again.  Does anyone doubt that the balls that became hits will on average be rated as easier in the second scoring, while the outs become more difficult? 

Now, whether this is a 5% bias or a 15% bias or a 30% bias, I have no idea.  It may be so small that it hardly matters.  But it absolutely has to be there, and I think it’s worth figuring out how big it is.


#142    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 17:23

As for the reasoning that we should do 50-50, isn’t that exactly how you tell everyone NOT to handle split credit?

The split credit in that thread is immaterial here.  The issue there is about looking at one player as his own universe, and how everything else around him will (eventually) be “average”.  So, Bobby Orr is a +124, and you don’t divide that by 5 (or 6), and give Bobby Orr just credit for +20 goals.  You presume that he has played with the other 17 skaters on the team equally, and therefore presume that the +124 is him, 3 average Bruins forwards, one average Bruins defenseman, and one average Bruins goalie.  (It’s a reasonable enough guess, though you can also say that because he was +124, he probably played more with the better players; if for example Bobby Orr played 50% of the time, and Phil Esposito played 40% of the time, then you’d think that, on average, they were together on the ice 20% of the time; but, because Orr was a +124, Espo was probably on the ice a bit more with him, say 25% of the time).

In the issue we are discussing, it’s about figuring out what an average player would do given our player’s opportunity distribution.  If, for example, the average SS makes an out on 12.5% of all BIP, but UZR, with all its adjustments says that Jeter’s opportunities would give an average SS 11.5% outs, and Jeter himself made 11.5% outs, then there is likelihood of either overadjustment or bias in scoring (depending on number of BIP, naturally).

So, I was saying, split the difference, if you can’t trust UZR at 100%, and you don’t want to presume that everyone is equally average.  It’s a hedging the bet kind of statement.

And yes, Dewan and UZR can’t both be right.  They both agree that say Jeter made 11.5% outs on all BIP (factual evidence), but then one says that an average SS would have made 11.8% outs, and another says an average SS would have made 12.3% outs, while a league average SS made 12.5% outs.  You figure that the truth is somewhere in there.


#143          (see all posts) 2010/07/06 (Tue) @ 17:39

You figure that the truth is somewhere in there.

Yes, somewhere.

One thing people have suggested (which is not what I hear you saying, Tom, with that statement--correct me if I’m wrong) is that since DRS and UZR differ, we improve our measurement if we average the two measures.  I found that averaging DRS and UZR (let’s call that AVG) did nothing to improve the year-to-year correlation, whether you look at AVG vs UZR, AVG vs DRS, or AVG vs AVG from one year to the next.


#144    Nick Steiner      (see all posts) 2010/07/06 (Tue) @ 18:09

How much has the bias been shown to be?  I mean let’s take the LD park factors at face value:

http://www.hardballtimes.com/main/article/batted-balls-and-park-effects/

Angels stadium has a line drive park factor of .97 - I believe that number is regressed although I’m not sure how.  Honestly that doesn’t seem like such a huge bias that it is what is causing Tori Hunter to be rated a below average defender during his time in Anaheim.

MGL, what would happen to UZR scores if you accounted for the observed park bias in LD allowed?


#145    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 18:10

So, I was saying, split the difference, if you can’t trust UZR at 100%, and you don’t want to presume that everyone is equally average.  It’s a hedging the bet kind of statement.

But Guy is exactly right - if you’re talking about systemic error, not random error, that sort of “regression” won’t help any. If such and such a player is consistently over- or underrated, either for reasons having to do with the data or the method, you’re still going to see that represent itself if you halve everyone’s defensive rating.

I know we don’t want to presume that everyone is average defensively. But I’m still trying to find a reason to believe it isn’t at least as accurate as using one defensive metric, or even the aggregation of several of them.


#146    Brian Cartwright      (see all posts) 2010/07/06 (Tue) @ 18:19

Guy 141 - something I’ve been advocating for a long time is for the scorer (be it BIS or BAM or whoever) to simply record which infielder had the best chance. 30 years ago, using a paper scoring system, my scorers would record a ground ball hit to lf as 7H, with a dot over the left part of the H if the 3b had the best chance at the ball, or a dot over the right side of the H if the SS had the best chance. Gameday could say “Chris Burke singles on a ground ball to left fielder Reggie Sanders, past shortstop Hector Luna”. Then we would know that it’s Luna’s ball and not the 3b.

Adding vectors would help decide the difficulty of the play, but given a large enough sample it evens out. At the core I want to count how many plays made, and also count how many plays not made.


#147          (see all posts) 2010/07/06 (Tue) @ 18:31

Nick/144, the problem is that we don’t know if LD park bias is the only bias involved.  Doesn’t the the data in #81 and #103 imply that the park/team/scorer bias is much larger than the 3% park factor you quote?


#148    Colin Wyers      (see all posts) 2010/07/06 (Tue) @ 18:31

Brian, how is that not duplicating the exact problem Guy is talking about? At issue is the ability of a scorer to figure out the position of an object in relation to other objects.

Presume, as Guy says, 100 balls hit between the SS and 3B, such that (with an average 3B and average SS), each player would field roughly 50 percent of the outs. Now, if you have an SS who is above average, and a 3B who is below average, how is the SS not going to appear to a scorer to be responsible for a larger percentage of those balls than the 3B? At the point of scoring, the SS will appear to be closer to the ball and thus more responsible (whether its through better positioning or better range) than the average SS.

Guy (rightly) points out that this can occur when a scorer is asked to mark the location of the ball on the field. If you rephrase the question the way you state it, shouldn’t we expect at least the SAME amount of bias, if not more?

(And it occurs to me - at some level, scoring off video has to be worse than scoring in person for this. On TV, they try to present you as close a view of the action as possible, thus depriving you of other reference points.)


#149    Brian Cartwright      (see all posts) 2010/07/06 (Tue) @ 18:43

Nick #144 - I first wrote about LD park factors here
http://www.fangraphs.com/blogs/index.php/what-i-hate-about-line-drives/

Currently I have Anaheim Stadium at 0.81, lowest in MLB. Coors at 1.13 and Arlington at 1.12 are the highest.


#150    Nick Steiner      (see all posts) 2010/07/06 (Tue) @ 19:22

Brian, do you know why your numbers are so more extreme than David’s?

Mike, I’m not sure about that.  For one, the “switched teams” numbers you quote are only dealing with N = ~25.  I really don’t think that’s a large enough sample to glean much compared to the “non switched teams”, which is N = >100.  Haven’t MGL and Tango been drilling the point home on this blog that the correlation is a function of both the amount of agreement between the two variables AND the sample size? 

Mike, could you please run the same numbers by position, but instead of using every player that has stayed on the same team could you randomly selected a certain amount of those players so that you have the same sample size for each group?  Could you also run the same analysis using RMSE instead of R? 

Also, I could think of a whole host of possible selection bias issues for players who switched teams so that the correlation or RMSE might be skewed.


#151    Brian Cartwright      (see all posts) 2010/07/06 (Tue) @ 19:37

Nick, other than methodology I don’t know.

David’s article at THT says it was using 2003-2007, which I presume is RetroSheet data.

My Fangraphs article used RetroSheet 2003-2008. Oliver uses Gameday 2005-2010, and results are consistent, same spread, small (generally <.05) differences in factors.


#152    MGL      (see all posts) 2010/07/06 (Tue) @ 22:25

"MGL, what would happen to UZR scores if you accounted for the observed park bias in LD allowed?”

I assume that in any parks where there are more line drives than there should be (because of scorer bias for whatever reason - height of press box, etc.), the outfielders will be overrated and in parks where there are fewer line drives, outfielders will be underrated. The effect should not be that great since I am already doing park adjustments, and these are in essence park factors.

Sure, we would like to adjust for that.  Do we know that line drive park factors represent scorer bias as opposed to legitimate park factors?


#153    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 23:01

If we consider a “line drive” to be based on a certain range of the speed off the bat, and a certain range of the launch angle, it would be hard to believe that the type of park could affect that. 

Why would a park for example, get a BIP to get the average Albert Pujols launched at an average launch angle of 12 degrees, but another one at 10 degrees, and one park at 92 mph and another at 89 mph (controlling for pitchers)?

All numbers for illustration purposes only.


#154    Nick Steiner      (see all posts) 2010/07/06 (Tue) @ 23:13

It could be the visual stimuli (although that has to be minor) or more likely air effects.  I would think Coors Field is more likely to yield line drives than others.


#155    Tangotiger      (see all posts) 2010/07/06 (Tue) @ 23:25

I believe, accept, and know that the visual stimulii will give you more or less contact in each park (K and BB rates are the most impactful of the park effects).

But, GIVEN that a ball has been contacted, are we going to expect to see the launch angle and speed off the bat vary at anything other than small differences?  I’m not going to expect a park to add 10% or 15% LD am I?


#156    Nick Steiner      (see all posts) 2010/07/07 (Wed) @ 01:52

Well I think it’s possible that the visual stimuli could change how well the batter squares up the ball, which could manifest itself in more or less line drives.  It isn’t contact or nothing, there are obviously different degrees of contact.


#157    Colin Wyers      (see all posts) 2010/07/07 (Wed) @ 02:19

Nick, I’m going to make this sound a lot easier to do than it is (if it were THAT easy, I’d have done it already - there are a lot of confounds), but here goes:

If the park effect is changing the actual trajectory of the batted ball, it will impact the batted ball outcomes. What, for instance, is the correlation between the park factor for LD% and the park factor for BABIP? If a park sees an increase in both (after you account for regression and use a large enough sample, etc.), then it’s likely that you’re seeing an effect on the actual batted ball. If you see a change in LD without a change in BABIP, scorer bias may be the more likely explanation. (There are other indicators to look at - correl between FB PF and HR/Con PF, maybe? Or look at TB/Hit, perhaps.)


#158          (see all posts) 2010/07/07 (Wed) @ 09:44

Nick/150, the slope should not be affected by the number of players in the sample.  That is why I presented that number.

Also, you’ll note that the numbers don’t go way up for non-switchers for the left side of the field, just the right side of the field, and, as I noted earlier, positions like 2B and CF with lots of chances don’t have higher numbers than 3B and LF.  So if you want to allege that this is all or mostly due to sample size, I’d like to see some evidence of that.

I’m happy to share my data if you want to do further investigation.  Of course it’s all freely available and very easily downloadable from Fangraphs using their Excel/CSV button.

I don’t think I have all the answers.  I definitely believe this merits a lot more investigation by the community, but I don’t think someone’s going to be able to simply deflate my findings without also getting his hands dirty in the subject.


#159    Chris Dial      (see all posts) 2010/07/07 (Wed) @ 11:09

I haven’t read very far, but DAMMIT, DRS was the name I put on my system a decade ago.  Dewan needs his own freakin abbreviation.

I feel like Microsoft is stealing my ideas.


#160    Tangotiger      (see all posts) 2010/07/07 (Wed) @ 11:34

Chris, I follow all the saber articles, and I completely forgot that the DRS name was yours. In any case, he didn’t steal any ideas from you.  But he did unintentionally lift your name.

What should be the proper redress for that?

IIRC, the term “Ultimate Zone Rating” was used in the STATS scoreboard, and then MGL popularized it, and now it’s associated only to him.

I had Leverage Index (LI) for years, and then Woolner created “Leverage” (LEV).  I’m not upset that he took the name, but I am upset at the potential confusion in the marketplace. 

(Interestingly, Phil Birnbaum created Relative Importance a bit after I created LI, and it has a very similar methodology to LI.  Pete Palmer predated LI by decades with his measure called Stress, with again, a similar methodology.  I don’t think Pete is upset with me that I created something similar, slapped a different name, and set it so that the scale is average = 1.)

It’s a tough call here.  What do you think should be done?


#161    Nick Steiner      (see all posts) 2010/07/07 (Wed) @ 12:18

Mike, the slope should not be affected by sample size, but the odds of pulling in a good correlation will be.  But I’d be glad to try to prove you wrong wink

Do you think you could post your data here?


#162    Chris Dial      (see all posts) 2010/07/07 (Wed) @ 13:01

I follow all the saber articles, and I completely forgot that the DRS name was yours. In any case, he didn’t steal any ideas from you.  But he did unintentionally lift your name....What should be the proper redress for that?...It’s a tough call here.  What do you think should be done?

Oh, nothing.  It’s just tough cookies for me.  So it goes.  To me, it’s just disrespect - I am not significant enough to worry about in the industry.

But yes, I think MGL should be using something other than UZR - that was STATS.  Maybe Dewan figures if MGL can clip his, then he can clip mine. 

Truth be told, in 1997, defensive analysis had largely died before I resurrected it using STATS ZR (on USENET).  You and MGL etc did a critique of my work on Fanhome when James Fraser posted it on Baseball Scholars.  We had a specific discussion around the value of an out at the time (this was 1999 or 2000, IIRC), and you were initially unhappy with the value of a defensive play being ~0.8.  You solved it eventually though.  MGL in thre thread said it could be done better with the raw data, and so he went and bought it.  That’s largely why there is no MGL-UZR before that discussion.  His methodology is nearly precisely mine (or was originally), including how one has to assign run values.  Likewise, until a few years ago, nearly everyone used my run values in all defensive analysis, but when the thank yous are passed around (or credit is given), I am passed over.  It does sting.  When I was interviewed in Popular Science for my system, I credited Dale Stephenson and Ron Johnson (and mentioned Pete Palmer’s work on LWts), despite the fact it had been well over a decade since either were involved in the work.

I’m sure I’ll lose OPD before too long as well.

All I ask for from anyone is that “yes, Chris Dial begat a good portion of this work”.  And of course, only if someone actually thinks it happened that way.


#163    dkappelman      (see all posts) 2010/07/07 (Wed) @ 13:26

Chris, the DRS acronym I believe is my fault and I just wasn’t familiar with your DRS fielding system.  I have not seen John Dewan refer to his system ever as DRS and solely as “Defensive Runs Saved”.

I originally on FanGraphs put up what is now DRS, under the heading +/-, but there was confusion between that and the +/- plays made, so I changed it to DRS because of “Defensive Runs Saved”. 

Actually, all the headings on FanGraphs for Dewan’s Fielding system, (rPM, rHR, etc...) are just my doing trying to squeeze in as much data in a limited amount of space as possible.


#164    Tangotiger      (see all posts) 2010/07/07 (Wed) @ 13:33

But yes, I think MGL should be using something other than UZR - that was STATS.  Maybe Dewan figures if MGL can clip his, then he can clip mine.

I would say Dewan had no knowledge of your work.  Why do I say that?  When I was shopping The Book around, I talked with ACTA Sports, and ended up talking with Dewan.  And he told me he was doing “this thing called plus/minus for fielders”. 

I said: “??  Uh, that sounds exactly like what my co-author has already popularized with UZR… do you know about it?”

Dewan was either completely surprised by it, or that his was very different.  Seeing that his system and MGL’s largely overlap in concept, it’s clear to me that Dewan was on the very periphery of the internet-saber age, if involved at all.  I would be surprised that he would know about your work.  Or anyone’s on the internet.  How could he not have known about MGL’s UZR, seeing that that was an advanced version of STATS’ UZR when he was there?  Well, he didn’t know.

Dewan does now surround himself with more internet-saber guys, like Ben J and studes, so, he’s far more aware these days.

Basically, we’re Canal St, he’s Wall Street, and he didn’t go up Broadway to see us.


#165          (see all posts) 2010/07/07 (Wed) @ 13:34

Nick/161, the data is literally straight from Fangraphs. 

What I did with it is make a worksheet for the Fangraphs data dump from each year, then did a bunch of VLOOKUP statements to create the year-to-year comparisons on a new worksheet.  Then I sorted based on innings played, position, and team-switching and made graphs from those y-t-y comparisons.

The raw data is better taken from Fangraphs than me trying to repost it somewhere.  The vlookup statements and charts aren’t easily postable here.  I could post the charts, but with the correlation coefficient and the slope, you basically know what the charts look like, and there are a lot of them such that it’s not a very efficient use of my time.

I’m willing to email the spreadsheet that I used to anyone that wants a copy of it.  It is a 10 MB .xlsx file.


#166    Tangotiger      (see all posts) 2010/07/07 (Wed) @ 13:38

David: well, they are both called Defensive Runs Saved (Dewan and Dial).  So, we’d expect them to both have the same acronyms too.  I don’t see any issue with you here.

I however would propose that SOMEONE change it to Fielding Runs Saved (unless of course someone else has that).  It bothers me to no end that “defense” = “fielding”, when defense is actually pitching+fielding.

That’s why I call it FIP, not DIP.  It wasn’t to differentiate it from Voros’ DIPS.


#167    Tangotiger      (see all posts) 2010/07/07 (Wed) @ 13:40

Googling this:
FRS “fielding runs saved”

And there is exactly TWO hits in the whole internet world.  And that was by our own Matt Swartz who was, funnily enough, referencing John Dewan’s work.


#168    Chris Dial      (see all posts) 2010/07/07 (Wed) @ 15:06

I would say Dewan had no knowledge of your work

I agree 100%.  I kid when I say he saw it and went “Oh, I’m using that”.  That’s why it just happens.  I apologize if my particular tone around that name thing is anything other than a curiousity.  David, I appreciate you making your comment.  I am just being flippant.  I probably shouldn’t write it anymore.  I do like Tango’s suggestion regarding Fielding runs.  Perhaps *I* should use that.  Oh, Palmer and Gillette had Fielding Runs, and that’s why I didn’t use it - I considered what was already in the vernacular.

OTOH, this wouldn’t have happened if my work were cited instead of “ZR runs”, where the fact that *I* made the “ZR runs”, and AFAICT, STATS never even did it the way I did it; no one has.  Rally did a little originally, (and we correlated at 0.97 using the same dataset).  So whereever you see ZR Runs, that is usually been DRS, but since no one bothered saying Dial’s DRS, then it wasn’t “mainstream”, and so David doesn’t pick up on it. 

Again, I’m not a professional at this, and I am IN NO WAY offended or hurt that BIS uses DRS.  I worked very hard to move defensive analysis from the cast aside child in 1997 to MGL’s next generation in 2000, and being omitted from that timeline , considering both UZR and TZ were generated after lengthy consideration of my methodology and input.  I cite my progenitors, and all I ask is the same consideration.

I really appreciate SG at RLYW for keeping my work alive.


#169    Brian Cartwright      (see all posts) 2010/07/07 (Wed) @ 18:12

When I was reading references to DRS in this thread, I was assuming it Chris Dial’s work.

After I posted the spreadsheet of my defensive numbers, I realized I called it FRAA - I used that term in my code thinking of it as a generic term, not realizing that it is BP/Clay’s brand name.

Oliver Defensive Runs? I’ll think of something.


#170    terpsfan101      (see all posts) 2010/07/07 (Wed) @ 19:00

There are a ton of metrics out there that share the same name/acronym. There are different versions of batting runs, runs created, WAR, fielding runs, park factors, etc… Just pick a name that best describes your system. Don’t give it an inferior name just because the name you want to give it is already in use.


#171    Chris Dial      (see all posts) 2010/07/07 (Wed) @ 19:35

Sure, terpsfan, but there are a handful of defensive stats.  Two have the same name.


#172    Tangotiger      (see all posts) 2010/07/07 (Wed) @ 19:47

Palmer may have Fielding Runs, but it’s not called Fielding Runs Saved, is it?


#173    terpsfan101      (see all posts) 2010/07/07 (Wed) @ 20:26

Chris, I don’t think that your work on DRS is unappreciated. I referenced your work a few times when I devised my fielding runs metric. You were way ahead of your time. Your metric still holds up well because zone rating is a solid metric, and the method you used for converting plays made into runs is very accurate. Your run values for plays made closely match what I got from the Retrosheet zone data (1989-1999). Your metric is actually better now than it was when you first devised it, since zone rating is a better stat now than it was 10 years ago. Prior to 2000 (not too sure about the exact year), zone rating didn’t charge a fielder with an opportunity when they made a play outside of their prescribed zone. But they did give the fielder credit for the plays made out of their zone. So a play made would get added to the numerator without an opportunity being added to the denominator.

Here’s an early link to Chris explaining DPI:

http://groups.google.com/group/rec.sport.baseball.analysis/msg/908d1d9b6b6674c4?pli=1

Chris, I hope you are still not using XR. About a year and a half ago, you said that you would switch to linear RC if someone came up with some equations for you. Did you ever use the linear RC formulas that I devised for you here:

http://www.insidethebook.com/ee/index.php/site/comments/my_1b_is_better_than_your_1b/


#174          (see all posts) 2010/07/11 (Sun) @ 21:49

Having just actually read this thread for the first time (and i know i’m late), would i be correct in assuming that at least a part of the bias would be corrected if we had Hit/FX Angle/Velocity Data (and we could theoretically create a new perfect system with Field/FX)? 

And Hit/FX is only currently being sold to teams?  I know its’ not public, but i was wondering if that sort of thing was avaiable for sale like BIS data or if it was different.  Since it seems like it would solve a lot of problems here.


#175          (see all posts) 2010/07/12 (Mon) @ 12:45

Garik/174, the simple answer to all your questions is yes.  In practice it’s not so simple.  Not having spin information in the HITf/x data turns out to make a big difference for air ball trajectories, for instance.  But it could still be useful for some purposes.

I’m not aware that HITf/x data is being sold to anyone but the teams.


#176          (see all posts) 2010/07/12 (Mon) @ 12:51

Here is the update to post #81 and #103 using Total Zone data, as gathered from Baseball-Reference.com on or about July 10.  I assume that was still the old system at that point (not using Gameday horizontal spray angle data).  Unlike for UZR and Plus/Minus, the range-only runs were not split out, so I used total fielding runs.

First the correlation coefficient r:

Pos    Switched    Same Team
RF    0.18    0.23
CF    0.02    0.27
LF    0.11    0.30
3B    0.16    0.33
SS    0.31    0.30
2B    0.06    0.40
1B    0.18    0.43

And the slope:

Pos    Switched    Same Team
RF    0.21    0.20
CF    
-0.02    0.26
LF    0.14    0.31
3B    0.20    0.30
SS    0.28    0.29
2B    0.06    0.42
1B    0.18    0.43


#177          (see all posts) 2010/07/12 (Mon) @ 12:56

I may as well put the numbers from #81 and #103 into similar format so that it’s easy to compare.

For UZR range, correlation coefficient r:

Pos    Switched    Same Team
RF    0.05    0.55
CF    0.05    0.44
LF    0.23    0.56
3B    0.38    0.42
SS    0.41    0.53
2B    0.18    0.52
1B    0.13    0.30

For UZR range, slope:

Pos    Switched    Same Team
RF    0.06    0.63
CF    0.07    0.44
LF    0.29    0.60
3B    0.32    0.45
SS    0.44    0.49
2B    0.13    0.50
1B    0.13    0.27


#178    Bosox1324      (see all posts) 2011/11/06 (Sun) @ 08:52

sorry to bump this because it’s really old, but I found some interesting stuff.

I used team data from 2002-2010 from fangraphs using UZR for their WAR. I then found the correlation of WAR to wins, it was .88. I then replaced UZR in the WAR with DRS, and TZ, DRS got .87 and TZ got .85, Using no defense at all got .86


#179    MGL      (see all posts) 2011/11/06 (Sun) @ 14:11

In your WAR, did you use pitching WAR?  I’m not real clear what regression you did to get your “r”.  Can you elaborate?


#180    Bosox1324      (see all posts) 2011/11/06 (Sun) @ 20:57

I just calculated the hitting and added the pitching unadjusted afterwards. I used a multiple linear regression to get R


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 09 19:40
Psst… wanna intern in Canada?

Feb 09 19:10
Who’s evaluating the 2011 forecasts this year?

Feb 09 18:35
MGL: Today on Clubhouse Confidential

Feb 09 17:36
New PECOTA

Feb 09 16:38
The will of the people?

Feb 09 16:25
Correlation of pitcher metrics: FIP strikes again

Feb 09 11:56
Forecaster’s Challenge: 2012?

Feb 09 11:45
When is a life entity considered a person?

Feb 09 10:08
Change in fastball velocity by going from starter to reliever

Feb 08 22:41
Batman, the webslinger?