THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, February 19, 2009

Intentionally using less data?

By Tangotiger, 10:55 AM

Tim Marchman writes:

while reading the new edition of the Baseball Prospectus annual, I was a bit put off by this information on their new play-by-play defensive metric:

The best PBP systems rely on highly detailed batted-ball data—a direction for where the ball was hit, some indication of how hard, and the result of the play, with the field broken down into many, many fairly small zons. That data is typically available only for the majors. To keep the majors and minors on an even setting, we’re dealing with a reduced set of data.

As I understand the idea here, BP wants to make apples-to-apples comparisons between their minor league and major league defensive numbers, and so is artificially crippling the data set they’re using to derive the major league numbers to bring it into line with the less granular data available for the minors. I see the appeal, but it makes the topline numbers suspect, especially when the system arrives at seemingly wonky results like Bobby Abreu rating as a plus defender and Hanley Ramirez as a Gold Glove candidate last year. Of course even very good systems have outliers, but not every system intentionally deals with a reduced set of data. For now I’ll continue to rely on UZR and Plus/Minus, though I’ll be curious to see what people like Tom Tango have to say about the technical pros and cons of the new system.

Since Tim asked, I’ll respond. 

I have no doubt that Clay wrote that.  Why is that?  Because this is the same thing he said a few years ago when defending using the non-PBP numbers in FRAA, to make the “apples-to-apples” comparison for all of baseball history.  I disagreed completely then, and I disagree just as much now.

Why is that?  Because the biases are not systematic.  Presuming that Clay is using Dan Fox’s Simple Fielding Runs (SFR), then the SFR of Darin Erstad in 2000 has no stronger relationship to the SFR of Darin Erstad in 2001 than to the UZR of Darin Erstad in 2001.  Indeed, I would bet that the relationship of SFR in year X is stronger with UZR in year X+1 than to SFR in X+1!  (Shades of ERA/FIP discussion.) The only way for the year-to-year SFR relationship to be stronger is if there is a systematic bias with SFR to begin with.

As a good example, an early version of PMR (Pinto’s model) had a super love affair with Orlando Hudson, while UZR merely liked-to-loved him.  Why is that?  One possibility was that Orlando Hudson was a popup hog, and he was getting a benefit there every year.  That’s a systematic bias. 

But, this doesn’t apply here with SFR and UZR.  There is no question at all that you always want to use the maximum data (combined with an intelligent approach, natch) and using the same methodology does not, in-and-of-itself, provide an apples-to-apples comparison.

(Hat tip: Repoz)


#1    TOLAXOR      (see all posts) 2009/02/19 (Thu) @ 12:43

SOUNDS LIKE A JOB FOR BAYESIAN MODEL SELECTION!!!


#2    David Cameron      (see all posts) 2009/02/19 (Thu) @ 14:43

Yep, Clay wrote it.  I flipped through BP09 at Barnes and Noble the other day, figuring that Clay would have two long back-of-the-book articles explaining his new PBP system and the changes to WARP. 

Instead, they were compressed into one article that was about three pages long, ignored a lot of important questions, and was generally head-shaking in nature. 

I didn’t bother reading any of the rest of the book.


#3          (see all posts) 2009/02/19 (Thu) @ 21:45

I can understand his thinking. I am writing queries to analyze play by play, and I have software that will download and parse GameDay xml files into a MySQL db, for both minor and major leagues. There is better data available for the majors, but I probably can’t afford it right now, although I figure BP can. Dan Fox told me that when he was at BP he got his data already parsed.

But, then I have all the data in one db and can run the same queries, and one of the things high on my to-do list is see how things like fielding change when the player moves from one level to another.

So, yes, BP can get more detailed/granular data for the majors, but even GameDay is still way better than Clay’s old FRAA.


#4    Tangotiger      (see all posts) 2009/02/20 (Fri) @ 08:12

We don’t like his statement justifying it, that’s all.  It would be better to say “it’s easier for me to code”, which is the reason I’d do it that way.


#5          (see all posts) 2009/02/22 (Sun) @ 12:16

I was surprised at how limited the detail was in Clay’s article at the end of the book.  He admits that their old system didn’t work very well for first basemen, but he doesn’t have a table showing which players’ scores changed. 

I don’t see why he doesn’t just show multiple fielding statistics?  There’s nothing wrong with being able to see the old FRAA, this new system, and also the best possible PBP system they could come up with for MLB players only.

The best insights come from understanding where various systems diverge…


#6    Rally      (see all posts) 2009/02/22 (Sun) @ 14:11

I can see where space limitations would force them to pick a fielding system and stick with it for the book.  They had to leave out the player index because the book was too big (it’s now on their website).

For some reason, I found the article on the fielding system very, very familiar.

Clay mentions that the data is incomplete before 2005.  They must be using some other source of data as this is not true for Retrosheet. For 2004, Retrosheet codes every single, double, or triple as a G, F, L, or P, and is missing who fielded a hit on only a small number of liners and flyballs, about 500.  Most of these are probably explained by ground rule doubles.


#7    Peter Jensen      (see all posts) 2009/02/27 (Fri) @ 11:34

There is no question at all that you always want to use the maximum data (combined with an intelligent approach, natch) and using the same methodology does not, in-and-of-itself, provide an apples-to-apples comparison.

Tango - I do not have the new Prospectus and have not read the actual reasoning about why information was left out, but I am surprised that you are taking such an adamant approach on this.  There are sound reasons for leaving out data in a metric.  Isn’t that the one thing that Voros taught us with DIPS?


#8    Tangotiger      (see all posts) 2009/02/27 (Fri) @ 12:26

Yes, there are sound reasons to leaving out data in a metric.

But, there are no sound reasons for leaving out data in this particular metric.  Indeed, Clay in the book reiterates what he has often said as the reasoning for his bias, the using the-one-system approach.

This is wrong. 

Suppose, for example, you have plus/minus figures going back only to 1967.  Are you then going to completely ignore plus/minus for all players because you don’t have it for Maurice Richard?

If for example you have saves going back only to 1982.  Are you going to completely ignore the number of shots a goalie faced by Martin Brodeur and Dominik Hasek because you don’t have it for Jacques Plante?

No. If the data that you want would help you in evaluating a player, then you cannot discard that piece of data because you don’t have it historically.

FIP and DIPS intentionally leave out the data because they are only interested in evaluating a player on plate appearances that don’t involve the fielder.  Just as OBP intentionally ignores the fact that a HR is more valuable than a walk, because it is only interested in certain facets of the player’s performance.

For what Clay is doing, it is simply a bad choice on his part to stick to his bias of preferring the single-methodology approach.

***

By the way Rally, seeing the presentation in BP09, I can see why you said it looked “really really familiar”.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 08:49
Do pitcher’s reach back for velocity when needed?

May 25 08:11
What sabermetrics is NOT

May 25 06:43
Largest demonstration in Canadian history?

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story