THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, June 24, 2008

Pizza’s fielding system: OPA!

By Tangotiger, 10:42 AM

As Pizza and I fight over the best-sounding fielding system (WOWY v OPA!), he offers some background as followup to his introduction last week.

Gotta admit, while wowee is good, it’s really something that kids would say; opa is something that adults say, and with pride around plenty of other adults.


#1    MGL      (see all posts) 2008/06/24 (Tue) @ 14:07

I read his articles and the system looks like a good one using non-PBP data.  I especially like the idea of breaking up the evaluation into three categories, range, arm, and hands, although ultimately, of course, it does matter how a fielder gets the job done.  It does, however, allow us to get a handle on what makes the various fielders good or bad (or in between).  For example, Jeter has always had excellent hands and arm strength and accuracy.  What makes him a bad fielder is only his range.  If you just look at one number (like UZR, or Dewan’s plus minus, or ZR, etc.), you would not necessarily know this.  Again, it does not make a difference in terms of how many runs a fielder saves or costs, but it is interesting to know.

Actually there is some use for breaking a fielder’s skills and performance down into those categories.  One, there is probably a different aging curve for the various skills (e.g., a recent article by someone - I forgot who - suggested that “hands” skill peaks much later than range skill), and two, there is probably a different regression for each skill, for example, hands has more ‘luck’ than range, as well as the fact that everything should be probably be regressed separately, as I don’t think there is a whole lot of positive correlation between a player’s range, hands, and arm, and in fact, there may be some negative correlations, as Pizza suggests in his article.

However, let’s no forget that this is a non-PBP system, and by definition, it cannot hold a candle to a PBP one.  So, for Pizza to say that Jeter is vindicated (although he does say “at least according to my system,” or something like that) is a bit disingenuous for two reasons:  One, as I said, if the PBP systems say that Jeter is -20 and Pizza’s system says that Jeter is -8, the PBP systems trump Pizza’s.  And two, the “other” defensive analysts have said that Jeter IS a terrible fielder, which means that the data over the last 5 years or so indicate that he is a terrible fielder, not that his performance according to the data was necessarily terrible in any one year.  If Pizza’s system has Jeter as an average -8 outs per year over the last 5 years, then he can say that his system indicates that Jeter is NOT a terrible fielder.  I doubt that that is the case.  And even if it is, the PBP ratings necessarily trump his.

A non-PBP system, like Pizza’s, no matter how good it is, adds nothing to a (good) PBP one, unless the PBP one is a bad one of course.  And to clear up one other thing, UZR, and I assume most, if not all, of the other PBP systems, although I cannot speak for them, does NOT give a demerit to a fielder who makes a play but the first baseman boots the throw. That would be ridiculous and I am not sure why Pizza would think they would.  Maybe STATS ZR classifies that as an “error” or missed play for the fielder, I don’t know.  If they do, they made a mistake of course, and should correct it.

Great job by Pizza, but again, let’s please not think that any of these non-PBP systems adds anything to the PBP ones. They don’t.  Unless they discovered some important thing that the PBP ones are missing and those important things more than make up for the lack of hit location data that is inherent in the non-PBP systems like Pizza’s.

One more thing:  In the long-run the non-PBP systems are almost as good as the PBP ones, since exact hit location and batted ball speed tend to “even out” in the long run.  They don’t completely even out because fielders tend to play behind a certain group of pitchers and in certain parks throughout their careers and those parks and pitchers create biases in the hit location data that don’t necessarily even out in the long run.

In the short run, non-PBP systems are much worse than the PBP systems.


#2    Tangotiger      (see all posts) 2008/06/24 (Tue) @ 15:01

I think Pizza is saying that if Jeter is throwing to Gary Sheffield at 1B, and Sheff bobbles the ball, then Jeter doesn’t get a credit for an out.

***

MGL is correct that, if a PBP system uses the same data as the non-PBP system, plus the extra ones (granular grid location, etc), then he is right.  However, I don’t think that each non-PBP system is a subset.

For example, WOWY is not.  While UZR and the others gives a general adjustment to all infielders for a home park, WOWY adjusts based on the exact number of innings played at each of the 30+ parks.  WOWY considers a groundball given up by Roger Clemens to be different from one given up by Mike Mussina, even if all the other parameters (slice, location, hardness hit) is the same.  UZR doesn’t distinguish between the two, other than their tendency to allow GB.

So, there is some intersection between WOWY and UZR, and what WOWY doesn’t know is certain alot more than what UZR doesn’t know.  However, because UZR chooses to ignore known information that WOWY embraces, that means that we cannot say for sure that UZR is better.

UZR v ZR however, we can, since ZR is a subset of UZR.

BP’s fielding system doesn’t use any data that UZR doesn’t use, and therefore, is inferior.  However, BP and ZR intersect, and one is not a subset of the other.  For example, BP considers handedness of “staff”, while ZR ignores it completely.  From this standpoint, it is possible that BP would be superior.  I’m not saying that it is, but just saying it’s possible, depending on what data one system is using that the other is not.


#3    Pizza Cutter      (see all posts) 2008/06/24 (Tue) @ 15:25

MGL, a couple of clarifications.  Jeter is -8 on groundballs only for last year.  I’ve run the LD numbers, but I don’t have them on me.  IIRC, he was another couple plays down on that one as well.  There are also pop-ups to consider.  I might not get down to -20, but I think it at least passes the smell test to say that Jeter is well below average in my system.  Also, I’m not sure how the more detailed systems handle double plays.  Jeter picked up some points there in my system.

I have no delusions that I’m going to out-do any of the PBP systems, because Retrosheet’s data just aren’t that fine-grained.  (But they’re free!)


#4    MGL      (see all posts) 2008/06/24 (Tue) @ 20:46

I think Pizza is saying that if Jeter is throwing to Gary Sheffield at 1B, and Sheff bobbles the ball, then Jeter doesn’t get a credit for an out.

I know what he is saying.  I am saying that in UZR, he DOES get credit for an out, as he should.

I agree that all non-PBP systems are not necessarily a subset of the PBP ones.  I think WOWY is a good example, and I have said that it is brilliant!  Given large enough samples, it is almost perfect!

However, I think that Pizza, Gassko, and the other “retrosheet” data systems are all subsets of the PBP ones like UZR and Dewan.

I am not taking away anything from these non-PBP systems.  They are great when you don’t have the hit location data.  But you should never use them to “trump” or even to “add to” a PPB one.  They should only be used when a PBP one is not available.  Just like OPS adds nothing to lwts.  It is still useful though for various reasons.


#5          (see all posts) 2008/06/24 (Tue) @ 21:10

What I got most out of reading Pizza’s articles was the ability to break fielding down into the various components. While overall contribution is still what matters most to winning games, it is nice to be able do this kind of analysis. Plus, I believe that this concept can be carried over o other systems that may have more precise input data


#6    joe arthur      (see all posts) 2008/06/24 (Tue) @ 21:21

I’m stumped.

To me, PBP should stand for “play by play”; the retrosheet systems are play by play systems, adjusted for handedness, base-out context and sometimes for park as well.  What they are not is systems which make fine-grained adjustment for hit location.

What does PBP stand for which translates to that?  Precise Batted-ball Parameter?


#7    Pizza Cutter      (see all posts) 2008/06/24 (Tue) @ 22:10

The PBP thing confuses me as well.  In any case, I’d be absolutely thrilled if the component aspect was carried over and used with something like BIS’s data.  One of these days…


#8    MGL      (see all posts) 2008/06/24 (Tue) @ 23:55

Come on Joe!  We all know what we mean, even if we are using the wrong terminology!  What rag of a college did you go to?


#9          (see all posts) 2008/06/25 (Wed) @ 00:38

Question for Tango:

Is it possible to use a WOWY system to evaluate minor league fielders? Dan Fox’s SFR is the only MiLB fielding stat (besides errors and range factor) that I know of, and we don’t really know if it’s good or not since there’s nothing to compare it to. Would changing talent levels (because players are still learning and whatnot) diminish the value of anything a minor league WOWY would produce?

I’d like to know your thoughts on the subject. Thanks.


#10    Peter Jensen      (see all posts) 2008/06/25 (Wed) @ 06:07

Pizza - I am glad that someone has taken on the task of trying to improve the estimate of the actual opportunities an infielder has in a non-zone type of fielding metric.  While the concept of the earlier Retroheet based fielding metrics was a good first step, the 50-50 distribution was obviously not accurate enough for more than ordinal comparison of fielders.

If you could break down your chart of responsibility into yearly data it would be interesting to see how much variation there is year to year.  You could probably eliminate pitcher handedness as a factor since your chart from the 1990s data seems to show that most of the variation in the responsibility comes from the handedness of the batter with very little from the handedness of the pitcher.  This will also give you larger sample buckets for the different Retrosheet zones when you begin to tackle the problem of the interaction of adjacent players with different range abilities on each other’s total GB opportunites.


#11    Rally      (see all posts) 2008/06/25 (Wed) @ 09:39

I used the 92-98 retrosheet files to look at the hit distribution, and I think it was used for the 2007 TotalZone ratings. 

It does not make a big difference though.  The original estimate said that ground singles to LF were split 50/50 between 3B and SS.  The more accurate split is 60/40.  But it makes almost no difference once you calculate runs relative to position.

Yes, I’m charging 10% fewer hits to the third baseman, but I’m charging 10% fewer hits to every third baseman.  The 60/40 makes it a little more accurate, but it is not a big deal at all.  I doubt it changed the ratings of any player by more than a run or 2.


#12    Tangotiger      (see all posts) 2008/06/25 (Wed) @ 10:08

Rally/11: good point.  While we also love the extra precision, the gain is fairly slight, since, as Rally pointed out, all the 3B get the same bias.

***

Peter/10: I agree with Peter’s general sentiment here. And to me the fascination is not so much in terms of incorporating things into a system, but simply knowing the split for the sake of knowing, as something to appreciate.

***

Dan/9: the WOWY is great if “all other things equal”.  As it is now, WOWY always only looks at park/handedness in “with or without”, and pitcher/bathand, batter/pitchhand.  It still has an issue since I should use all those parameters at once, except it would kill my sample size.  However, we can accept it, because we can look at it in totality here.  But, on the minor league side, it’d be much tougher, since we don’t really expect the same quality of hitters at every park if we start crossing over league lines.  I mean, even in MLB, it’s not the same, but it’s “close enough” that we can look at park/handedness and not worry that the distribution of the other parameters we are not controlling for (identity of pitchers, batters) is not causing an undue bias.  I am much less convinced that this could work in the minors.

Dan’s system, or any system, I think would be tough for the minors, if our expectation is that fielding for minor leaguers is on the steep slope upwards.  Remember that speed peaks at around age 23, and so I can see a huge improvement in fielding between ages 19 and 22, things that are probably more noticeable visually than with sample data.

***

MGL/8: for those new around here, that’s an inside joke between MGL and Joe. 

***

Joe/6: yes, to me, PBP is anything that identifies the batter, pitcher, and fielder for a play.  Any additional parameter is a bonus.

***

MGL/4: well, that’s great!  I didn’t know that.  Maybe you can give us a rundown at a high-level as to the changes you’ve made since your release about 5 years ago.


#13    Colin Wyers      (see all posts) 2008/06/25 (Wed) @ 12:03

The thing is, we have hit location data for minor league fielders, at least as much as we do PBP data - it’s in the Gameday XML files. I’m trying to do a fielding evaluation system based upon that data, but I need a way to get the Retrosheet file format from the XML file so I know who the fielders were for each play.


#14    Rally      (see all posts) 2008/06/25 (Wed) @ 12:43

Colin, do you have Adler’s Baseball Hacks?

I think he had something in there with getting the xml into a retrosheet format.  I tried using the script last year some time to get gameday data but eventually gave up.  I’m not very good at perl programming.

On giving the SS credit for a 1B dropping a throw, it’s great that this gives credit to the shortstop, but how often does it happen?  We all remember Sheffield struggling in the playoffs, but I can’t think of any other 1B to do that.  Almost every play with a runner safe at first on a throw, the error is charged to the infielder.

For the 1B to get the E, it would have to be a throw that does not bounce in the dirt, or cause the 1B to leave the bag.  For a major leaguer, Sheffield excepted, catching that throw is probably more automatic than catching an infield popup.


#15    Colin Wyers      (see all posts) 2008/06/25 (Wed) @ 13:15

MLBAM changed some of their file formats since Adler published Baseball Hacks; I’m pretty sure that his code doesn’t work past 2006.


#16          (see all posts) 2008/06/25 (Wed) @ 14:42

I was browsing the MLBAM server ar luncg, and they do appear to have changed some of the file formats a little each year - so there’d have to be a slightly different parser for each year.

There’s a couple different parsers out there, but so far none that I know of that output into something close to RetroSQL.

I am currently working (off and on) on how to map the xy coordinates in the GameDay files to locations on the field, measured in feet. Not all of their hit placements are real accurate, but the vast majority should be plenty close enough for the kind of fielding analysis we’re doing here.


#17    Colin Wyers      (see all posts) 2008/06/25 (Wed) @ 15:29

Brian, I know Mat Kovach’s parser works for multiple seasons; the problem is that we don’t have a record of who the responsible fielder was for that play, and of who caught the ball. The second we could parse out of the verbatim, I suppose, but I lack confidence in my ability to do that reliably (I am the world’s worst computer “programmer.")

For the first - obviously the XML has the information in it, because we know that the the Flash ap shows us who is playing where. I just have no idea how to parse that data out.


#18    Mike Fast      (see all posts) 2008/06/25 (Wed) @ 16:14

I am currently working (off and on) on how to map the xy coordinates in the GameDay files to locations on the field, measured in feet. Not all of their hit placements are real accurate, but the vast majority should be plenty close enough for the kind of fielding analysis we’re doing here.

I believe Peter Jensen has done this.  You might want to talk to him.

http://www.hardballtimes.com/main/article/is-seeing-believing/


#19          (see all posts) 2008/06/25 (Wed) @ 18:52

Mike, thanks for the link. It looks like Peter has already done what I was attempting, which is to incorporate the GameDay batte dball locations into my Retro database. Greg sent me the same Hunter/Jones spreadsheet, and I was in the process of matching coordinates.

Mike, you are apparently a much better Perl programmer than I am at this point (I am studying your code). Do you have the timne to modify your parser to create RetroSQL compatible tables?

Colin, I have Matt’s parser also, although I believe it was in Python and I was having problems with that. I have been using Matt’s Perl spider as it allowed me to get the minor league games as well.

Peter rightly pojnts out that the GameDay data gives the location that the ball was retrieved, which for fky ball hits is not what we want for fielding analysis, but it is a piece of information that is still worthwhile to have. I posted on this site recently that I would like to have location the ball could be fielded, who had the best chance to field it, the location where it was retrieved, and who retrieved it.

One of the best things about the GameDay data is that it exists for the minor leagues as well, from 2005 forward, and is FREE for those of us with enough programming smarts to expoit it. I look forward to having the 2005-2008 minor league pbp in the same database as my major league pbp by next year’s opening day.

Then not only can we project a minor league player’s batting and pitching, but also be able to look at his fielding, as Pizza has shown us (range, hands, arm, etc) as well as baserunning, outfield throwing, cather’s throwing, pitcher’s holding runners, etc - just about everythign we can do with major league pbp data.


#20    Peter Jensen      (see all posts) 2008/06/25 (Wed) @ 21:08

Brian - I am having to redo all my hit location data.  I had used Greg’s HitTracker data on Home runs to establish a location for home plate and a multiplier to convert the MLB data into an angle distance format.  It turns out the the home plate location is different for each park’s data.  So I am having to establish a new home plate location and multiplier for each park.  I hope to have that done by the All Star break and will post it here then.  The same problem may exist for the minor league data and I am not sure how that can be solved without some known hit locations (angle and distance)for each park.


#21          (see all posts) 2008/06/26 (Thu) @ 03:20

Peter - I suspected that may be true.

With GameDay using Java, they have an upper left origin, and I had calculated a home plate location of 125,200 (I may have the x and y reversed) with a scale of approx 3 feet per pixel. With the Hunter/Jones data, I was sticking to fly ball outs in my comparisons. I also used some plays this year when the ball was hit exactly to a base.

Once we get into the minors, there are so many more ballparks.

Here’s something that just popped into my head. How about if we selected ballpark, batter handedness, ground ball outs, infielder with assist, and x-y, then cluster the balls hit to each infield position at each ballpark, finding the mean and SD of the location for each of the infield positions. This should be the same from park to park. We should then be able to compare these 6 clusters at each ballpark with known data from a control group, and then calculate the location of home plate for each park.

That is, as long as they keep the same origin at each ballpark, and don’t change it game to game.


#22    Peter Jensen      (see all posts) 2008/06/26 (Thu) @ 07:08

Brian - Your plan might get you close enough.  It is actually what I did for the major leagues that caused me to discover the problem there.  The clusters for Baltimore 1st asnd 3d baseeman were in foul territory! Apparently changes were made between 2007 and 2008 to try and standardize the grid system for major league parks so 2008 data should be more consistent.  I don’t know if they also tried to standardize the system for minor leagues but I will inqire.


#23    Rally      (see all posts) 2008/06/26 (Thu) @ 09:52

Anyone who is building a minor league pbp database, would you consider donating a copy to Retrosheet?

I wonder if Dave Smith would be interested in expanding Retrosheet’s coverage in that direction.


#24    Peter Jensen      (see all posts) 2008/06/26 (Thu) @ 10:14

Rally - It would be a great boon to sabermetric research to have public access to a minor league PBP database.  That said, David Smith has maintained extremely high standards of accuracy in the information released under the Retrosheet name and rightfully so since it is that accuracy that gives researchers confidence to use the information for their research.  Since the only current available source for minor league PBP information is the unvetted files MLB gameday XML files, I am not sure that a PBP database constructed from them could ever meet Retrosheet’s standards for accuracy.


#25          (see all posts) 2008/06/27 (Fri) @ 23:02

Rally - I fult intend to have a minor league pbp database in MySQL by opening day. I have partially spidered the xml files, and once RetroSQL is production ready, I will need to have a parser that will export the data into that schema, so that the minor league data will be in the same db format and can be mixed with the major league data, and any queries developed for the major league pbp will be valid for the minor league dsta, except for pfx (they do have pitch location and result for recent seasons, but not speed or pitch type).

I will gladly make this database available to the public.


#26          (see all posts) 2008/06/27 (Fri) @ 23:16

MGL or anyone else here who might have the data - in trying to translate the GameDay hit location coordinates, I believe I can do an affine transformation (shift, scale, rotate) but I need a set of control points in both coordinate systems.

What I do not have is the mean xy (or distance and angle) from home plate, by batter hand and infielder (c,p,1b,2b,3b,ss) of as many ground balls as possible. This would be the bulk of the control points in the system we want to convert to.

I (or Peter) can calculate the same mean xy in GameDay’s Java coordinates, by batter hand, infielder, and ballpark. Assuming the distribution of ball locations is consistent from ballpark to ballpark in a large enough sample (2005-2008), the six infield positions can be matched with the mean shift, scale and rotation necessary to minimize the rms error. This generates a conversion matrix from one sytem to the other. (I’ve written software for generating these transformation matrices at the mapping company where I work).

One problem I can think of is that all of the control is in the infield, and even if the errors seem small there, extrapolating outside of that, into the outfield, can magnify the errors (don’t map outside your control!). I could include the outfielders in the cluster matching, but I’m assuming that the clusters in the outfield are much larger (more variance) and thus the centroid location may not be as accurate. If outfield locations are included, we’d have to stick to fly ball outs, as GameDay reports the retrieved location for hits.


#27    Rally      (see all posts) 2008/06/27 (Fri) @ 23:19

Brian, you are my new hero.  For my part I should set up TotalZone queries compatible with retrosql so we can continue analysis of minor league fielding.



#29          (see all posts) 2008/07/09 (Wed) @ 23:38

I do remember Rennie Stennet hitting a groundball up the middle, past the ss, at Candlestick back in the mid 70’s that went for a HR!

I would not limit the input to just groundballs, but also look at how many xbh on fly ball hits as well.

I would also normalize for ballpark, even on gb the park can have an impact on what % are xbh.

This is similar to what Dan Fox did in SFR. He used just about everything - xb on both gb & fb hits, and advancement of the runners, to get a +/- run value for each fielder (in addition to the hit or out ratio)


#30          (see all posts) 2008/07/09 (Wed) @ 23:46

On Retro’s missing hit locations - mlb.com now has reinstated Condensed Games video. For $15/month in-season or $15 for the entire off-season, we can watch video of most every games in a max of 10-15 minutes.

GameDay gives us the location the ball was retrieved (still need to do conversion formula for the coordinates). It’s major shortcoming is not recording where fly ball hits landed.

By parsing the GameDay data, and then going back and watching the video, a team of people should be able to get almost all the batted ball locations, especially the hits. Even at 15 minutes a game, someone could do one team’s home games in about 20 hours, a full weekend of work - or if done in-season could be spread out.

Video archives are available for 2006 until now.


#31    Tangotiger      (see all posts) 2008/07/29 (Tue) @ 11:07

The OPA! leaders for 2007:
http://mvn.com/mlb-stats/2008/07/28/the-real-gold-and-lead-gloves/

One day, I’ll try to get to the bottom of the Ichiro conundrum.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Aug 31 15:28
Fans Scouting Report: Update

Sep 02 15:17
Mail: rWAR v fWAR

Sep 02 15:08
The two uncertainties of UZR

Sep 02 14:59
Roger Federer

Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are

Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?

Sep 02 14:15
WOWY Teachers

Sep 02 13:37
Who’s Waldo?

Sep 02 08:36
Team Elin

Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?