THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, January 25, 2011

Fangraphs Library

By Tangotiger, 06:02 PM

More saber-friendliness from Fangraphs.

I’ve been meaning to dump, I mean, move, the Tangotiger Wiki somewhere.  Steve, if you are out there, and want it, let’s talk.


#1    Steve Slowinski      (see all posts) 2011/01/25 (Tue) @ 18:46

Got it...awesome. I’ll shoot you an email real quick.


#2    BenJ      (see all posts) 2011/01/26 (Wed) @ 00:33

Yeah… the wiki.  A few of us got together and made some progress… I ended up finding that the regular wikipedia was already pretty thorough, and we were making progress at a snail’s pace. 

Kudos to Steve for what he’s done.  (I think I’ll have a few comments on the Defensive Runs Saved section.)


#3    Colin Wyers      (see all posts) 2011/01/26 (Wed) @ 01:37

I have a question. From the section on “Plate Discipline”:

Baseball Info Solutions (BIS), where FanGraphs gets this type of data, changed how it calculates these statistics over the years. When looking back at plate discipline numbers from 2002 or so, be cautious because the league averages may be very different than what they are today.

Is this true? I thought the calculations for these numbers were the same over time - O-Swing, for instance, is just the portion of pitches marked outside the zone that are swung at, right?

Or is this just an awkward way to refer to the problems in the underlying data?


#4          (see all posts) 2011/01/26 (Wed) @ 11:36

Colin - I wrote what had always been my understanding (and what I could find, to the best of my knowledge), but to be honest, someone like you will know better than me. I can fix it if you confirm this...just let me know. Click on my name above and you should get my email.

Also, I’d love it if smart people like you and Tango can provide feedback on sections. I tried to be as accurate as possible, but I know that my base of knowledge on saber stuff is far from perfect. I’m also hoping to eventually expand and add some BPro stats like SIERA, so I may be in touch to make sure I’m not being inaccurate.


#5    Lee Panas      (see all posts) 2011/01/26 (Wed) @ 15:29

Steve, I’m not smart like Wyers and Tango, but I’m very interested in saber education and plan to read the whole thing thoroughly when I get a chance.  I’ll let you know if I see anything that needs to be changed.



#7    Tangotiger      (see all posts) 2011/01/26 (Wed) @ 16:42

Colin:

I’ve asked BPro in the past, and perhaps with you being a player there, perhaps I can get you on my side on this one.

Huckabay once told me that the vast majority of the articles are read when they are “new”.  That is, a tiny minority of articles are read via searches or archives.  The main benefit of the BPro articles is that they are read fresh.

Seeing that the older articles have limited value from a subscriber standpoint, and they would have tremendous value from a research perspective, can’t the older articles, say those more than one year old, be opened up to the public?

In addition to having limited loss of value to current subscribers, you get these benefits:
- hits from google search
- free advertising for potential subscribers

Furthermore, BPro’s contributor’s agreement has some IP on it that allows the author to republish his work after 6 months or 18 months or something.  Dan Fox for example likely saw it my way, because he would republish his old articles on his blog.  Rather than Dan republishing on his blog, I always thought it would have been better for all concerned to simply open up those articles.

***

Do you think this is a good idea?  And if so, can you champion it?


#8    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 00:23

Well, lemme just say (and this is mostly for anyone else who might read this) a few words in defense of making money: it lets me do things like eat food that I’ve grown rather fond of. So it would have to be something that makes sense for us there.

That said… something like this really needs to be updated:

http://www.baseballprospectus.com/library/

And I think there would be some value to opening up old pieces. I wouldn’t expect any movement on it right now, as it’s getting to be our peak time of the year for fantasy stuff and I’m busier than all get out. But in a few months, maybe we see what that looks like.


#9    Tangotiger      (see all posts) 2011/01/27 (Thu) @ 12:26

Making money:

Right, I wasn’t suggesting doing some public service or something.  I was suggesting that you would “invest” in opening up old articles (those that are a year old and that are being virtually unread right now and therefore would barely cut into your bottom line if at all), and in return, you get massive google hits, and newbies who would be exposed to those articles.  And that investment would pay off with additional subscribers.  As a byproduct to that, you’ve also done a public service.

Let’s take an example.  Let’s say that I sell Marcel annually.  What value do the 2009 Marcels have today?  So, why would I not then post them so that researchers can use it.  Perhaps they can even test it, publish their results, and say “hey, darn it, those Marcels ARE good”.  Google hits come back also showing baseball forecasts, and so on.  All the while, I can still sell the 2010 and 2011 Marcels.

And, like I said, BPro ALREADY removes the publishing constraints from the authors after a certain period of time.  Therefore, they no longer have exclusive publishing rights to those articles.  Why make Dan Fox repost his articles on his own blog?

Anyway, think about it after spring…


#10    Nick Steiner      (see all posts) 2011/01/27 (Thu) @ 19:15

Colin, Steve is definitely right that the changes in the league averages over the years is largely due to changes in the way the stat is calculated. 

http://www.hardballtimes.com/main/blog_article/is-the-bis-data-right/

O-Swing, for instance, is just the portion of pitches marked outside the zone that are swung at, right?

I believe that how “outside” is being defined is the root cause.  I don’t think it’s a problem with the data persay (because that wouldn’t reflect in changes in the league averages), but just changes in the way the stat is defined.


#11    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 19:43

Okay, so how is it being calculated differently? The definition as presented is simply one number being divided by another - how on Earth can it be “calculated differently” without being an entirely different stat?

And how is a changing definition of the strike zone boundary not a problem in the data? I mean, the commonly called strike zone doesn’t change that much from year to year, does it? (And in fact for pitches where the batter does not swing, the BIS definition of the strike zone seems to be whatever the umpire called it.) The rulebook strike zone certainly hasn’t changed.


#12          (see all posts) 2011/01/27 (Thu) @ 20:24

Nick, the link you cite is just speculation that, among many other possibilities for the problems with the data, that one possibility is that BIS has changed the definition somehow.

Do you have a different, definitive source about a change in their definition, why they would have changed, the exact details of a change, etc.?

I don’t see how a problem with the data couldn’t be reflected in a change in league average.  In fact, when PITCHf/x does not show such a change, a problem with the BIS data would seem to be the simplest answer.

We can speculate that they’ve changed their method, but if they have, why would they sell that data to people without telling them that they can’t compare data from year to year because of a change in method?

With all the other problems that have been found with BIS data, the simplest assumption would seem to be that the data is just bad.


#13    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 20:30

Okay, here’s a simpler question.

If it’s a change in definition, rather than an issue intrinsic to the data, why can’t the old values just be recomputed using the new data?

(And why on earth are the LA teams from 2007 “using a different definition” from the rest of the league that season?)


#14    dkappelman      (see all posts) 2011/01/27 (Thu) @ 20:55

First off, I calculate O-Swing% based on the raw pitch data BIS provides.  The strike zone is defined as the coordinates that BIS gave me that matches up with the strike zone.  Consider this a “fixed” strike zone.

Looking at the league averages, something has changed from year to year.  Since 2007 things had been pretty stable.  In 2010 I believe they used improved entry software, which may have had something to do with the 3% downtick in Zone%.

For pitches where the batter does not swing, it is not whatever the umpire called it. I see taken strikes that are “out of zone” and balls that are marked “in zone”.

If you look at O-Swing above/below average, things are very consistent from year to year, even though the average is changing.  Basically, as long as you’re taking the average into consideration, you’ll be ok.  I think when I took a look at BIS data vs pitchf/x data somewhere in this blog they were quite similar.  This reminds me that I was supposed to put O-Swing above/below average up on FanGraphs!


#15          (see all posts) 2011/01/27 (Thu) @ 21:09

So, David, you’re saying that the definition has not changed because you are the one that is applying the definition to the data, and you are using a consistent definition?

That confirms that something is badly wrong in the incoming BIS location data, as I suspected.


#16          (see all posts) 2011/01/27 (Thu) @ 21:30

If you look at O-Swing above/below average, things are very consistent from year to year, even though the average is changing.  Basically, as long as you’re taking the average into consideration, you’ll be ok.

Your second sentence does not follow from the first.

If there is something wrong with the BIS method for recording locations, it would be quite possible for them to be consistently wrong in the same way and in a way that adjusting to league average would not fix.  Consistency relative to league average is not a good proof of data quality.

For instance, they could be wrong about the zone size/location for all left-handed batters.  That would show up as being consistent from year to year, but it would be a problem when evaluating individual players.  Or they could be wrong about the pitch locations in specific ballparks.  I’m not saying either of those are what is happening.  I’m just pointing out ways that the data could be bad that would not be illuminated by looking at year-to-year correlations for comfort.


#17    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 21:32

And if you look at the evidence of park bias, you should expect to see some level of persistence year-to-year in terms of things like O-Swing Above Average even if the data is otherwise totally random at the individual player level.


#18    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 21:42

And to clarify an earlier comment - I saw a very high level of agreement between the percentage of balls and strikes on pitches where the batter did not swing, between BIS and the actual league rates. And this data is much more stable than the balls/strikes rate for pitches where the batter swung, which seems to cause almost all of the instability seen year-to-year.

Again, another point - you should expect to see SOME stability year to year, based upon pitches where the batter took. Between that and the park effects, there should be a presumption of some stability year-to-year in values like O-Swing. But none of that is any indication that BIS is adding value to data that’s freely available - the actual number of balls and strikes on pitches where the batter did not swing - with their estimates of balls and strikes on pitches where the batter did swing.

And as you note, BIS has trouble maintaining a consistent level of pitches in the zone given a consistent set of “coordinates that BIS gave me that matches up with the strike zone.” What reason do we have in thinking that they can’t figure out the actual rate of balls and strikes for the league, but they can produce useful data for individual players?


#19    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 22:05

Okay, one more and then I promise I’ll clear the floor and let someone else weigh in before I comment again.

But… 2007, Los Angeles.

http://www.fangraphs.com/teams.aspx?pos=all&stats=pit&lg=all&type=5&season=2007&month=0&season1=2007

http://www.fangraphs.com/teams.aspx?pos=all&stats=pit&lg=all&type=5&season=2007&month=0&season1=2007

Is there ANY conceivable explanation for those two teams that season that doesn’t involve either being unable to catch serious data quality issues or catching them and just not caring?


#20    dkappelman      (see all posts) 2011/01/27 (Thu) @ 22:15

So, we know there’s a problem with LA teams in 2007.  This has been well documented.

The correlation between BIS OSwing and PitchFX Oswing in 2009 and 2010 is:  .95 and .97 by my calculations.  (I’m including/excluding the same data in both)

If BIS data has severe issues when measuring what is inside and outside the strike zone, then it would seem pitchf/x is too.


#21    dkappelman      (see all posts) 2011/01/27 (Thu) @ 22:18

I should correct myself and say that there may be problems with one/both of them, but they are capturing the same thing in regards to this stat. 

Also, I looked at batters with at least 1000 pitches seen.


#22    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 22:28

So, we know there’s a problem with LA teams in 2007.  This has been well documented.

I’m glad we have at least come to one point of agreement. You have access to the raw data, and presumably BIS is more interested in answering your questions on the topic than they are mine - do we know what the problem is? And do we know if there are other teams that are affected by the same problem, albeit more subtly?

As for your comparison with Pitch F/X, I’ll reiterate what I said the last time you compared the two data sets:

http://www.insidethebook.com/ee/index.php/site/comments/bis_bias_bizz_buzz/

Findings about the quality of the BIS data compared to Pitch F/X are not necessarily transferable. I did the same SSAA correlation I did for the article, but I broke it down into ‘02-’6 and ‘08-’10, in other words before and after Pitch F/X:

Year R
‘02-’06 0.590
‘08-’10 0.228

The data quality is vastly better for the years where Pitch F/X is available than for the years it isn’t. (And I suspect the availability of Pitch F/X data is at least part of the explanation for the improvement in data quality.)


#23    dkappelman      (see all posts) 2011/01/27 (Thu) @ 22:49

I don’t know what caused the issue with LA.  Nothing like it has popped up since or is noticeable before that season.

I’m not sure what it matters if BIS data improved in 2008.  It could be for any number of reasons (HD feeds for every game, pitchf/x, better data checks, you name it...) but for the years where there’s something to test against, it seems pretty good.

I really don’t know what else to say and I’m not really sure what your point of contention is with the O-Swing stat as calculated using BIS data.  It’s the best thing we have pre pitchf/x data, and it’s nearly the same as current pitchf/x data.  And you can ignore all the 07 Angels/Dodgers data if you like (but I think it’s probably just skewed too high and could be re-adjusted if I ever got around to it).

Maybe there are some park issues that could be cleaned up, but honestly I don’t know.  I’m sure there are park issues with pitchf/x that could be cleaned up too.

You could look at only road data.  OSwing stabilizes really quickly, so that could potentially mitigate some of your concerns.

Is there something else I’m missing?


#24          (see all posts) 2011/01/27 (Thu) @ 23:06

I’m sure there are park issues with pitchf/x that could be cleaned up too.

There are, and I’ve spent untold hours identifying and documenting them so that I can understand them and correct them.  I don’t accept what MLBAM and Sportsvision tell me about the data at face value.  I also don’t satisfy myself by simply looking at league-wide correlations and saying that PITCHf/x is probably right most of the time (which it is).  I have figured out what I can trust and what I can’t in the PITCHf/x data because I have very specifically characterized and quantified the error by park, homestand, etc.  It frankly boggles my mind that people do not do the same with the BIS data and still expect it to be useful for serious sabermetric research.

It’s a major problem with sabermetrics these days that it’s about being entertained by with pretty stats and not grappling with the data in a serious manner befitting of scientists.


#25    Colin Wyers      (see all posts) 2011/01/27 (Thu) @ 23:58

Well just earlier you said:

Looking at the league averages, something has changed from year to year.  Since 2007 things had been pretty stable.

So we can see that something has been introduced that’s “stablized” the yearly averages. We can also see that, according to the data I presented on park effects, those have decreased sharply. So the data set has improved in quality - does anyone argue that point? Is there anyone who disagrees?

So then why would we think that findings about the quality of the data set after the improvement should hold for the data set prior to the improvement? I mean, I hesitate to point this out because it feels like I’m arguing from tautology.

And of course it matters why the improvement happened. If the improvement happened because of the introduction of Pitch F/X - in other words, if the Pitch F/X data is influencing the collection of the BIS data - then we should expect to see high correlations between the two data sets, right? And if we don’t know the extent that Pitch F/X influences the collection of the BIS pitched ball data, we have no idea how well your findings indicate the ability of the BIS stringers to independently record balls and strikes, as opposed to their ability to regurgitate the findings of Pitch F/X.

But I am willing to stipulate that there is a “reasonable” quality to the BIS pitch location data for any point in time when that data is totally superfluous. (After all, does anyone seriously think that BIS is doing a better job of tracking pitch ball location than Pitch F/X? And Pitch F/X data is freely available to boot.) I just don’t know why you think that gives us any indication of the quality of the data recorded prior to the introduction of Pitch F/X.

And without an evaluation of that data’s quality, how are we to know that it is, as you say, “the best thing we have pre pitchf/x data?” It is, I’m sure I will be told, the only effort to record where a ball was when the batter swung at it (absent things like Questec). But there’s absolutely no indication that the data was recorded with enough care to make it useful for evaluation - in other words, no reason to think that having the data tells us something that looking at things like K rate, BB rate, and called balls/strikes doesn’t. What we do know is that the data can, and has been, rampantly misused by people who compare players in different parks, or by comparing values across season without being aware of the drift in baselines over time.

And it’s all well and good to say “oh, we can just subtract the average.” But have you thought through the potential source of the error? Let’s say, as a total hypothetical, that in one year BIS operators were overstating the amount of balls above and below the strike zone, but understating the amount of balls to the left and right of the strike zone. Subtracting the league average may help “normalize” the data at the league level, but it will deeply misstate the swing characteristics of players who have better strike zone discernment in one direction than the other.

As for the LA data - I mean, it blows my mind that we can all sit here and go, “Yeah, that’s obviously wrong,” and still have this conversation about how much we can trust this data without trying to figure out how this happened. I mean - and someone please correct me if I’m wrong - but people tell me that to buy a year’s worth of all of the data you get, you’re basically dropping enough money to buy a new car. Not finance it, drive it off the lot paid in full. If I paid that much money for data, and I saw what looked like obvious errors, I would become the single-most annoying human being on the face of the planet (I’m sure you think this has already occurred) until I was either satisfied the problem was addressed or until my money was refunded. I absolutely cannot fathom why you wouldn’t be more interested in what the answer was there.


#26    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 03:12

A few points as I head off to bed:

1. Other than consistency, why do we need to have BIS location data if Gameday is available?  Is it for the PAs when there is no Gameday?

2. Can BIS flag or otherwise note its data source?  Can they say “this data fed directly from Gameday”, “this data overridden from obvious crappy Gameday”, “this data from TV, no Gameday available” for each of its 700,000 pitches?

And this is really a problem with GAmeday as well, when operators start adjusting the calibration device.  Basically, what we really want is a complete log of what is happening.  But, adjustments happen on the fly, we aren’t aware of it, and then, we’re not sure how to handle biases we observe.

If you want to get scared, try to process NHL.com event files.  Goalies not on the ice for 20 minutes, 20 skaters on the ice, the same skater on the ice twice, and so on.

In some respect, it’s good to have the complete unadjusted data, so that I can intelligently go through it and figure out the bias or other bad data.

On the other hand, I’d rather someone else do it… but that person tell me what it is they overrode and why.

3. The title of that linked thread, “Bis Bias, Bizz Buzz” is my favorite.


#27          (see all posts) 2011/01/28 (Fri) @ 09:11

1. Other than consistency, why do we need to have BIS location data if Gameday is available?

I don’t see that we do need it.

Is it for the PAs when there is no Gameday?

That was around 3-4% of pitches back in 2008.  In 2010 it was about 1% of pitches.

And this is really a problem with GAmeday as well

Yes, as discussed above in this thread (and in a lot of my other writings).  I find the standard deviation of the error in PITCHf/x plate location to be about 1 inch.

when operators start adjusting the calibration device.

This situation would be true even if Sportvision never touched the calibration of their cameras.  Actually, it would probably be quite a bit worse.  It’s not their adjustments to the calibration that’s the problem.  It’s that every measurement in the world has an error associated with it.  We want them to minimize the error, which they mostly do a pretty good job of.

Basically, what we really want is a complete log of what is happening.

Definitely.  Any information we can get is helpful.  It’s probably more helpful for the data provider in that they can get feedback to improve the quality of their data gathering than it is for us.  If it’s a matter of judging the data quality and measuring the error post hoc, we don’t need to have the data provider’s input to do a good job of that.  In fact, we probably don’t want their input on that.  You can do better science when you start with an independent view.

But, adjustments happen on the fly, we aren’t aware of it, and then, we’re not sure how to handle biases we observe.

When we have the data itself in hand, as we do with PITCHf/x, we can figure out how to handle the biases and errors quite well.  It’s when we don’t have the data in hand, as is the situation where we are second-hand consumers of processed BIS data, that it is extremely difficult to identify and correct the biases and errors in the data.

In some respect, it’s good to have the complete unadjusted data, so that I can intelligently go through it and figure out the bias or other bad data.

Yes, exactly.  And we have that with PITCHf/x data, which is why data quality is not nearly so much of a problem with that data set as it is with the BIS data set.


#28          (see all posts) 2011/01/28 (Fri) @ 09:15

Ack. This part was poorly worded.

I find the standard deviation of the error in PITCHf/x plate location to be about 1 inch.

What I meant was that I find 68% of the points to be within roughly 1 inch of true and 95% of the points to be within roughly 2 inches of true location.

That’s for random + systematic error.


#29    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 10:33

Back to the BIS data then: we’re not sure if there are systematic or random biases then.

David did a compare in the BIS BIAS thread, but we’re not sure if the reason it came out so well is because BIS used Gameday as a datasource in the majority of cases or not, and in the cases they didn’t, we’re not sure if they did a better job of cleaning up the problems or not.

In some respects, this reminded me of when we were cleaning up the Lahman database from other “independent” sources, when those sources would also use the Lahman DB.

I would think BIS should track the data source and put codes on each pitch.  Who knows, they probably do this already, and it’s not part of the main feed to Fangraphs and other customers.  After all, if someone at BIS overrides Gameday, his supervisor is going to want to know this, so that he can put a second guy on it to make sure the override was appropriate.

From the customer perspective, this is all a black box.  And if you don’t have someone independent verifying all this (that’s why god created the auditor), it becomes a case of “trust me”.  This applies to anything really, any system that produces results, and we don’t have access to the source code or in some cases the data set itself.


#30    Peter Jensen      (see all posts) 2011/01/28 (Fri) @ 10:49

What I meant was that I find 68% of the points to be within roughly 1 inch of true and 95% of the points to be within roughly 2 inches of true location.

Mike - When you say “roughly 1 inch of true” how do you know what “true” is?  Are you defining true as the Pitch f/x data after the normalizing process that you do?  Or are you comparing the extrapolated plate location from the Pitch F/x parameters to the more complex trajectory path used by Alan Nathan in his paper on the accuracy of the Pitch F/x system?


#31          (see all posts) 2011/01/28 (Fri) @ 10:55

After all, if someone at BIS overrides Gameday, his supervisor is going to want to know this, so that he can put a second guy on it to make sure the override was appropriate.

I can’t imagine a situation when an “override” of PITCHf/x plate location data by a BIS stringer would ever be appropriate.  There’s no way a human can get within a couple inches from video.  Not even close.  Now, the MLBAM stringers do sometimes make errors on assigning the data to the wrong pitch, etc.  They frequently come back and correct those in the audit.

But, my word, if BIS is reselling data that MLBAM is publishing, I’d imagine MLBAM would want a word or two or a hundred with them about the propriety of that!  I have no idea, personally, what BIS is doing on that front, other than a few anecdotes, but surely, Tango, you can’t be seriously suggesting that BIS label their data, “we stole this from MLB and resold it to you” and “we didn’t steal this from MLB”?  I’m guessing that’s a complete non-starter of a suggestion.


#32          (see all posts) 2011/01/28 (Fri) @ 10:59

Mike - When you say “roughly 1 inch of true” how do you know what “true” is?  Are you defining true as the Pitch f/x data after the normalizing process that you do?

Yes.

Or are you comparing the extrapolated plate location from the Pitch F/x parameters to the more complex trajectory path used by Alan Nathan in his paper on the accuracy of the Pitch F/x system?

Well, that, too.  That’s the random piece of random + systematic error. 

Alan found the constant-acceleration fitted points to be within about half an inch of true trajectory, iirc, and that was similar to the random error that Marv and his crew found with the foam board experiment.


#33    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 11:26

I should first say I have NO IDEA how BIS collects its data and what its data source is.  I have no idea of copyright implications, or if BIS has any kind of licencing agreement with MLB or MLBAM.

***

“I can’t imagine a situation when an “override” of PITCHf/x plate location data by a BIS stringer would ever be appropriate. “

I was thinking of those cases where an intentional ball was shown to be right down the middle.  You know, I’m thinking of the override on the “obvious” cases, much like when goals are reviewed in hockey or plays in football that you are only allowed to override the call on the field if it’s clear the call was wrong.  If it’s close, you let the call on the field/ice stand if you think that call was more likely to be wrong than right.


#34    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 11:42

Since a baseball is almost 3 inches, then any marking where the baseball is tracked to be at least 6 inches away from where it actually was is probably an easy override call.  Just guessing.

Note that being 6 standard deviations away, by random, is just about impossible even with 700,000 pitches.  Even at 5 standard deviations, you will observe less than one pitch out of 700,000 to be marked that bad.  By Random.  (Hopefully I did that right.)

So, you can use that as the standard: any pitch more than 5 standard deviations away from its true landing spot is an automatic override.  So, if you think the pitch is marked almost two ball lengths away from where you think it actually landed, then override.


#35          (see all posts) 2011/01/28 (Fri) @ 11:53

Tom, I’m not sure what you’re suggesting here (#34).

Are you suggesting that we use BIS as our quality control for PITCHf/x data?  There are far better ways to quality control the PITCHf/x data.  Obviously bad PITCHf/x data is pretty easy to identify simply by noting release points, speeds, or spin deflections that are nonsensical.  Because of the constant-acceleration constraint on the PITCHf/x trajectory, you don’t get wildly screwy plate locations that don’t also have wildly screwy results in other parameters.

There are more subtle errors in the PITCHf/x data, of course, but I’m not sure why BIS would have any expertise about those.


#36    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 12:09

Mike, I’m not setup to answer this question, but perhaps you are.

Let’s agree that every intentional ball thrown will never be anywhere close to the middle of the strike zone.  After all, if a pitcher intended to throw a ball, and he will throw it at half-effort, and if it is in the strike zone, the batter is going to swing, and therefore, it won’t be a called ball.

Therefore, if I ask you to look at your PITCHf/x database, can you tell me, of all the intentional balls thrown, how many are within a 9 inch radius of the center of the strike zone.  And if you draw a larger circle of say radius 15 inches, how many are in the outer circle but outside the inner circle.  Finally, how many are outside the outer circle.  (Or use better distances after looking at the data.)

What percentage of the intentional balls were in the inner circle and how many outside the outer circle?

Finally, of those in the inner circle, how many can you spot using other PITCHf/x parameters that were obviously garbage?

Might make for a good BPro article…


#37          (see all posts) 2011/01/28 (Fri) @ 12:47

Tom, of the 3813 intentional balls in the PITCHf/x data set from April 5, 2010 onwards (which includes some from the playoffs and Arizona Fall League), none of them were recorded as being within 15 inches of the center of the strike zone.

The closest was 16 inches from the center of the zone and the next closest was 18 inches from the center of the zone.  The third-closest intentional ball was marked at 21 inches from the center of the zone, and then there are another two dozen pitches out to 30 inches.

Btw, this is why it stuns me to see people bring up PITCHf/x errors when BIS data quality is discussed.  BIS data is a pile of junk compared to PITCHf/x data.  Moreover, the errors in PITCHf/x have been much, much more well documented and explored than the errors in BIS data.  They are really on completely different planes of existence in terms of data quality control.


#38    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 12:57

Mike, well that’s fantastic to know!  I remember in the early PITCHf/x days that someone, maybe it was John Walsh, showed an intentional ball that was down the middle.  Given that my initial guess was that every IBB be more than 15 inches from the center of home plate, and you are showing that indeed the closest was 16 inches away, that at least tells me that in the extremely obvious cases, PITCHf/x didn’t have any meltdowns.

Thanks for running that.


#39    dkappelman      (see all posts) 2011/01/28 (Fri) @ 14:38

No where have I ever claimed that BIS location data is better than pitchf/x.  Lately I have seen BIS data categorized as “garbage”, and “junk”.  All I was trying to show was that is not the case in regards to OSwing% in recent years where the average changed and the correlation is extremely high with the pitchf/x data.

If I didn’t chime in, you’d still be calling the 09 and 10 OSwing BIS data junk without fully exploring the publicly available dataset.  Anyone could have run the correlation I ran last night.

All I’m trying to say is that I really fail to see the point in jumping to conclusions on the BIS data.  Discrediting it because it’s not open and free and some preliminary tests don’t match up with what you’d expect doesn’t seems like a reason to completely lambast BIS and their data.  Is it a reason to take pause, yes, but the rhetoric in my opinion is more the former.


#40    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 14:56

If you have two pieces of data that correlate at r=.99 or something, then it would be tough to say that one is good and the other one is junk.

Now, we can say that BIS provides no value-added beyond Gameday.  So, we can say that their effort in that respect is worthless.  Or that their value is limited to making the Gameday data more accessible, and is worth only that.

The effort may have been wasted effort, but the final result still bears close resemblance to other sources, no?

***

I don’t look at the data, so I don’t know.  I’m not making a judgement call, nor am I saying what BIS is or is not doing.  I’m just setting some parameters to debate on.


#41          (see all posts) 2011/01/28 (Fri) @ 14:59

David/39, you did bring up PITCHf/x errors as a defense that we shouldn’t be concerned about BIS errors.  My point is that the two are not in the same ballpark in data quality.  I stand by that statement.

We’ve already discussed why the correlation test is irrelevant for determining data quality here.

You say 2009 and 2010 BIS O-Swing data is not junk.  Which one of them has 3% of its data wrong, then?

Discrediting it because it’s not open and free and some preliminary tests don’t match up with what you’d expect doesn’t seems like a reason to completely lambast BIS and their data.

The problem is that we can’t do real tests of the data because it’s “not open and free”.  If you want to attack my criticisms as “preliminary tests”, you can’t at the same time defend the closed nature of the data.

Since you have access to the data, would you be willing to post a list of the top 50 or top 100 largest discrepancies between PITCHf/x and BIS plate locations from 2010 (or 2009), so that we can actually have a comparison with some teeth to it?


#42          (see all posts) 2011/01/28 (Fri) @ 15:04

Tom, I would think that the concerns that Colin raised in the articles listed in #6 should define the parameters of this debate.

The points he raised, particularly in the second link, remain unaddressed.


#43    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 15:19

I presume David is limited by his licensing agreement as to what he can and can’t publish.

***

Mike: that second article is the one I linked to under BIS BIAS right?  Great stuff from what I remember, with heavy park influence right?

So, yes, in order for David to make a stronger case, he’d have to present the various correlations at a park/year level at least.  It’s easy to get r=.95 if you have r=.97 in 28 parks and r=.60 in 2 parks .

Furthermore, it’s also easy to get a high r on pitches around the middle or way outside, but the big concern would be on the smaller number pitches on the edges.

That is, systematic biases can be easily covered by huge quantity of noise.

So, yes, it would be very interesting to see more of a drill down here, by park, by location, even pitch type, to see where the biases are, and how large are these biases.

***

Again, BIS is likely going to limit any kind of testing or publishing of results of their data.


#44          (see all posts) 2011/01/28 (Fri) @ 15:37

Again, BIS is likely going to limit any kind of testing or publishing of results of their data.

I completely understand that.  Likewise, BIS and Fangraphs should understand that we will have little trust of their data under that arrangement when our tests show significant problems with it.

I would be happier with the BIS zone data if two questions were answered with some actual facts.

1. Why did the percentage of pitches marked in the zone at which batters swung drop from around 80% in the 2002-2005 time period to around 65% in 2010?

(Or, as a subset of that question, why did it drop from around 72% in 2009 to around 65% in 2010?)

If BIS changed something, how specifically did that change affect the data?

2. Why were the 2007 zone percentage numbers for LA and Anaheim 10% lower than for the rest of the league?

It’s not addressing the issue to write these off as outliers.  How do we know that what went wrong with those two team-seasons of data doesn’t also affect the rest of the data, just to a lesser extent?


#45    Tangotiger      (see all posts) 2011/01/28 (Fri) @ 15:42

I agree, BIS should answer these questions.  Make a formal request of them, and see what they say.  We need you to be your best Murray Chass here.

Normally I would do it myself, but I simply never use that data.

I know we DID do something like this with their hit location data that I was using, and I or someone did alert Dewan that there was as huge problem, and they did address it and fix it and they were happy to be told.  I think we had a thread about this back in 2007 or 2008.


#46    dkappelman      (see all posts) 2011/01/28 (Fri) @ 18:09

I think finding anomalies is important, but you want to throw the baby out with the bathwater because of something that happened in 2007, when in fact the data from 2008 (I just checked), 2009 and 2010 for OSwing all look pretty good using BIS data.

Next time I talk to BIS I will ask them about the 2007 data.  There is not an obvious outlier like that in any of the other years, or else I’m sure you would have found it by now.  There are probably park effects and I will concede that, but with such strong correlations with pitchf/x data I do not think they are that large.

I feel I have made a pretty strong argument that the 2008 - 2010 OSwing data is not junk.  Even the 2007 data has a .9 correlation and that is with the LA anomaly and the chance BIS was referring to any pitchf/x data that year seems pretty slim.

I did mention in my other post that I think the strike zone is slightly shifted from pitchf/x, but it doesn’t really seem to have much of an effect on the O-Swing results.  And I agree there might be some players where the effect is greater than it would normally be.

While I don’t want to trivialize your findings of park bias because I will re-iterate I think they are important, I think it is incorrect to use that argument to try and invalidate the entire dataset.


#47    Colin Wyers      (see all posts) 2011/01/28 (Fri) @ 21:18

If you give a baby a bath, you’ll have dirty water. That doesn’t mean that every pail of dirty water has a baby in it.

You have not yet shown any reason for us to think that the BIS scorers have an independent ability to determine whether or not a pitched ball was in the strike zone. I have shown many reasons to think that they do not, both in studying the data and presenting known optical effects in converting a three-dimensional trajectory into a two-dimensional image.

I mean, it would be one thing if I was just finding random irregularities. I’m finding irregularities (by which I mean the park effects) that were totally predicted by the known effects of parallax and other basic principles of optics. Putting a camera way out in center field and using a variety of optical tricks to make it look like the camera isn’t all the way out in center field makes it exceedingly difficult to accurately discern location of the pitched ball. I have no idea why you think that the default presumption should be that this data set is “valid” unless I can disprove every aspect of it.

As for:

Discrediting it because it’s not open and free and some preliminary tests don’t match up with what you’d expect doesn’t seems like a reason to completely lambast BIS and their data.

Really? I’m pretty sure that’s what Bill James did to Elias, and I’m pretty sure that’s why he did it. Lambasting data providers for this sort of thing is a foundational principle of what sabermetrics is - the search for objective knowledge about baseball. And that’s JUST as true when Bill James is one of the black hats now.

I feel I have made a pretty strong argument that the 2008 - 2010 OSwing data is not junk.  Even the 2007 data has a .9 correlation and that is with the LA anomaly and the chance BIS was referring to any pitchf/x data that year seems pretty slim.

The second half of this seems to be counter to the first part, in that you are showing that it’s possible for an exceedingly flawed data set (one where two teams have their O-Swing underrepresented by about 8%, just windaging it) post correlations of .9 or higher when sample size is restricted to 1000 or more pitches seen. The ‘07 correlations would seem to indicate that such correlations are insensitive to problems in the dataset, not that the dataset is of high quality.

I think finding anomalies is important

Good, I look forward to you continuing to look for them and announce them when you find them.


#48    dkappelman      (see all posts) 2011/01/29 (Sat) @ 10:33

Your last comment got me thinking a little more, because in some sense we are talking about two different things.  I’m focusing on O-Swing%, because that is where the conversation started, but to address what you are talking about I really should look at Zone%, because that is balls inside the strike zone / total pitches.

When you look at Zone% in 2010 vs the two data sources, you get a correlation of .9.  In 2009, it’s .85, in 2008 it’s .82, and in 2007 with LA data removed, it’s about .62.  What is more interesting is that if you look at only data where the batter swung at the ball, the correlations are in 2010, .93, 2009, .87, 2008, .89 and 2007, .76 (LA data removed).

So this is why the O-Swing% numbers are not particularly affected, because of the combination of it correlating strongly with Swing%, and that the correlations for Zone% in 2007 are better when a batter is swinging.

Anyway, perhaps that is more of an apples to apples comparison with what you’re talking about. 

With the numbers being lower for zone% for BIS data, but the correlation with pitchf/x being high for 2008-2010 I’m curious why you attribute the improvement in correlation solely to pitchf/x and that they have no independent ability to determine if a ball is in the strike zone or not?  If they were just going off pitchf/x wouldn’t the Zone% numbers be nearly identical?  While pitchf/x might certainly be a contributing factor, isn’t it also possible that the LA problem in 2007 led to more stringent data integrity checks?

In any event, I will add Plate Discipline data to our pitchf/x section for those who would like to use that data as well.  The more options the better, in this case.


#49    studes      (see all posts) 2011/01/29 (Sat) @ 11:14

Just catching up with this conversation, but Steve, you’re welcome to use/refer to this when you’d like:

http://www.hardballtimes.com/main/reference/

I’m still building it, so I haven’t publicized it yet.


#50          (see all posts) 2011/01/29 (Sat) @ 11:54

isn’t it also possible that the LA problem in 2007 led to more stringent data integrity checks?

This cuts to the heart of one of my issues here.  A lot of things are possible.  But what actually happened?  If BIS recognized a problem with the 2007 data, what did they identify as the cause, and what kind of fix did they implement, and how did they verify that the fix was effective?  It’s not clear to me that they did any of those four things. 

In absence of evidence, it’s contrary to good sabermetric practice to speculate what BIS might have done, and if we don’t know what they did, how can we know if it was effective? 

Well, we could know more if we had the data, but that’s a non starter for those of us who haven’t paid for the data. David, you have access to the detailed data and could do a more thorough study and possibly determine the answer to some of these questions, even if for some reason you are not willing to demand the answers from BIS.  Doing correlations from season to season at a league level is not helpful here.  The reasons for that have been outlined several times in this thread and elsewhere.  You need to do an investigation into the data to try to determine the cause of the problems.  Are they missing on outside pitches to LHB?  Is it curveballs that cause the problem?  Is it certain ballparks?  Etc.

Without those answers, we outsiders have two sorts of evidence that cast considerable skepticism on the accuracy of the BIS zone data.  (As you mentioned, yes, there’s not much reason to question the accuracy of the swing data.  It’s not that hard for a video observer to record fairly accurately whether the batter swung or not.)

One is the problems in the data itself, such as detailed above, with the decreasing in-zone swing percentage and the LA parks in 2007.

The other is the evidence that it’s hard for video observers to accurately judge 3-D locations on a 3-D trajectory in the middle of space.  Colin’s articles do a thorough job of explaining why that is, but I’ll add one thing he didn’t mention.  We have the detailed data from 2007 from the Gameday stringers who did exactly what BIS is trying to do.  Their strike zone locations were wildly inaccurate--by that I mean typically 6 or 12 inches off.  I have no reason to believe that BIS stringers do better.  For a long time with batted ball data it was alleged that BIS stringers did better than Gameday stringers because of better quality control or what not.  Then we finally got the evidence and it showed that their batted ball distributions were amazingly similar.

Hoping that BIS does good quality control is not good enough.  Any detailed evidence that we have shows that they do not.  So far it is only the speculative and vague that claims that they do.  Why should we accept that?  In this business, data talks and bullshit walks.  I am fully willing for BIS (or Fangraphs) to prove the quality of the BIS data.  I don’t have some hatred of BIS or Fangraphs.  Far from it.  Heck, I was very critical of Baseball Prospectus for a long time on these same pages, and look where I am now.  I am very forgiving when there is evidence of changes or proof that my skepticism was unnecessary.


#51    tangotiger      (see all posts) 2011/01/29 (Sat) @ 13:13

Doing correlations from season to season at a league level is not helpful here.

David noted that the correlation was between the “two data sources”, which I presume means he correlated BIS to PITCHf/x.


#52          (see all posts) 2011/01/29 (Sat) @ 13:22

David noted that the correlation was between the “two data sources”, which I presume means he correlated BIS to PITCHf/x.

And no more useful, for reasons you yourself have cited here, so what’s the point of saying that?  You’re obscuring the good points I’ve made by even bringing that up.  I’m disappointed.


#53    Tangotiger      (see all posts) 2011/01/29 (Sat) @ 14:01

Mike, I was correcting a statement you made. There was no point to be made.

No need to try to interpret it any negative way, and certainly not in such a negative way that it disappoints you.


#54    Colin Wyers      (see all posts) 2011/01/29 (Sat) @ 14:36

With the numbers being lower for zone% for BIS data, but the correlation with pitchf/x being high for 2008-2010 I’m curious why you attribute the improvement in correlation solely to pitchf/x and that they have no independent ability to determine if a ball is in the strike zone or not?

Well, that’s not exactly what I said. I said that without knowing how independent the scorers were of Pitch F/X data, it was not evidence of an ability of the scorers to determine from video alone whether or not a pitched ball was in the strike zone.

Of course, absence of evidence is not evidence of absence. It is possible - given ONLY what we see of the two data sets, in the time period of ‘08 to ‘10 - that the BIS scorers have shown some ability to determine balls from strikes. But without knowing more about the data collection process and possible “contamination” of the BIS data with Pitch F/X, we can’t determine that. (I know of at least two scorers who used Pitch F/X at time of data collection, so we know there’s SOME contamination.)

So given all that, why don’t I think that there is a significant ability to tell balls from strikes on TV? For the same reason people with eyepatches aren’t allowed to fly airplanes. The technical term is stereopsis:

http://en.wikipedia.org/wiki/Stereopsis

In other words, when you have two eyes, the parallax difference (or binocular disparity) between the two images allows the brain reconstruct a three-dimensional representation of the distance between objects. (This is similar to why Sportvision uses multiple cameras in Pitch F/X.)

When you lack stereopsis, your brain falls back on depth perception - using visual cues such as the relative size of objects, how clear/fuzzy they appear, etc. - to substitute for the lack of stereopsis. Depth perception is NOT as accurate as stereopsis under normal circumstances, however - if you don’t believe me, cover up one of your eyes and go walking around for a little bit.

But these aren’t normal circumstances - you’re dealing with a lens that’s offset from the path of travel of the pitched ball, and then zoomed in. If you look at the fundamental properties of lenses, the magnifcation of a lens is affected by two factors - the distance from the lens to the focal point and the distance of the lens from the object:

http://www.tech.plym.ac.uk/maths/resources/PDFLaTeX/lens_equation.pdf

In other words, the lens is going to magnify distant objects by a different amount than near objects. Again, recall what I said earlier: the relative size of objects is one of the major visual cues used to provide depth perception when bioncular disparity is not occuring. But the lens, if the focal length is greater or less than the that of the so-called “normal” lens (about 50mm or so for a 35mm camera; just totally spitballing here, but the typical center field camera I’d estimate is well over 200mm in terms of focal length), is distorting the perspective and in essence “fooling” our brain into thinking the mound is closer to the plate than it is in actual fact. So the trajectory of the pitched ball looks different on TV than it does in real life.

(I’m actually understating the problems here - in real life the “thin lens” equations don’t totally predict the behavior of lenses made out of actual glass, and so you have factors like barrel distortion and what have you as well. It’s straightforward, once you have the equations solved for, to account for this sytematically. But it’s typically not something that you notice is occuring unless you’re paying attention to it.)

Now if you actually study ballparks and look for where they’re placing the center field cameras, you find that there is actually quite a bit more divergence in location than you would assume from watching games. TBS used to publish all their ballpark site surveys for baseball telecasts - they’ve stopped updating it, but it’s still the single-most valuable resource for ballpark camera locations I’ve seen:

http://depth_of_field.blogs.com/

I encourage anyone who cares about this to flip through the slideshows there and take a look at how different parks are set up.

So that’s why the findings of park bias are so interesting to me, absent the very real concerns of the effect park bias has on the data. What they’re telling us is that the scorers are not able to account for the differences in camera placement in terms of the optical effects involved. In other words, the idea that you can accurately collect pitched ball types and locations is dependent upon the idea that the “video scouts” can take into account the optical illusions inherent in the video feed and compensate for them. All the evidence we have on the matter says this simply isn’t so.


#55    Colin Wyers      (see all posts) 2011/01/29 (Sat) @ 14:57

And I just want to note, for the record, that I have corresponded privately with people at BIS in the past about my concerns about their data quality. I don’t think these conversations were intended for public dissemination so I won’t repeat any of what was said, but nothing was said that increased my confidence the their data collection, or gave me the impression that they were going to become more transparent about what they’re doing.

And I have attempted to correspond with people like Dave Cameron and David Appelman at Fangraphs to privately air my concerns with the data and their use of it. I don’t know if Appelman ever got that e-mail, but I know Dave Cameron did. Again, I did not get the impression that there was much interest in dealing with the issues I was pointing out.

I am trying to be reasonable here, and I am trying to do what I think is best for all of us as a research community. Earlier, you said:

Discrediting it because it’s not open and free and some preliminary tests don’t match up with what you’d expect doesn’t seems like a reason to completely lambast BIS and their data.

I would be very open to hearing what other approaches you feel are available to me. Because it’s not just that the data appears flawed given preliminary testing (otherwise known as the ONLY testing possible without access to a more granular view of the data), or that the data is proprietary rather than open. It’s that every time I raise these issues with people who DO have access to the data they simply shut their mouths. BIS is free to conduct their business however they like but they should not be immune to criticism, and if they decide not to participate in the discussion over these data issues the rest of us have to decide how to proceed from there. Nothing I’ve seen gives me the confidence to give BIS the benefit of the doubt anymore, and if BIS is unwilling to defend themselves by presenting evidence only they would be able to, it’s starting to seem more and more like that evidence doesn’t exist.


#56    dkappelman      (see all posts) 2011/01/29 (Sat) @ 18:22

Colin, I do not think it is fair of you to characterize FanGraphs / me or Dave Cameron as uninterested in these issue.  You sent us one e-mail regarding this data we did not see eye to on eye, which I believe Mike publicly responded to on BP.  I have tried to be as open as I can about the data on FanGraphs and I typically respond to errors in a very timely manner.  I get a lot of e-mails and sometimes things fall through the cracks, so if I missed something of yours then I do apologize. 

I have tried to present data that shows that despite whatever irregularities or data collection issues there are in the BIS data, that the way we aggregate it on the player level on FanGraphs washes out a lot of the potential issues you are suggesting and matches up quite closely with a separate (but as you said, perhaps not completely independent) data source.  I am not using any data that is not publicly available to run these tests (with the exception of one with the swinging only correlations). I’m sure there are potentially better ways to aggregate the data and apply park factors and what not, but at the moment we’re not doing that and perhaps it’s something we should look into.


#57    Colin Wyers      (see all posts) 2011/01/30 (Sun) @ 11:53

David, it feels like we’re talking past each other here, and I think it’s because we’re coming at this from different starting points. From where I sit, BIS (and their defenders) saying that they accurately discern whether or not a pitch was a ball or a strike when watching on TV is about as credible as if they’d said they had invented cold fusion, or could transmute lead into gold. I simply don’t think it’s reasonable to start from an assumption that they’re doing what they claim to be doing.

And I have laid out, in excruciating detail, why I feel this way. But to sum up - it’s because I know a lot about cameras. Do you have nothing to say about any of that? Do you think I invented a time machine and forged ancient Greek writings about optics so that I could impugn the integrity of certain baseball stats I don’t like?

The article David is referring to, by the way, is Mike’s analysis of BIS cutter classifications:

http://www.baseballprospectus.com/article.php?articleid=12308

I’ll let everyone determine which side they want to fall on there.

And when you say “perhaps not completely independent” - fine, let me lay my cards on the table. Here’s two pictures taken by a former BIS video scout while he was working there:

http://plixi.com/p/17434507

http://plixi.com/p/23548800

Please note the presence of Gameday in both those pictures. I am not speculating about Pitch F/X data “contaminating” the BIS pitched ball data, I am absolutely certain it happens and that’s because I have evidence. (The question is simply to what extent it occurs.) You can speculate all you want to about “improved data quality controls” being responsible for the radical changes in the BIS data, but it strikes me as odd to put your speculation on equal footing with the facts I’m presenting here.

As for:

I have tried to present data that shows that despite whatever irregularities or data collection issues there are in the BIS data, that the way we aggregate it on the player level on FanGraphs washes out a lot of the potential issues you are suggesting and matches up quite closely with a separate (but as you said, perhaps not completely independent) data source.

I’m absolutely certain you have tried to do that. But we know at least two issues didn’t wash out - since the ONLY view I have of the data is the aggregated view, then any problems I notice in the data by definition didn’t wash out. If there are problems that aggregating the data has solved, I wouldn’t be able to detect what they are. So if anything I’m understating the amount of problems with this dataset.


#58    Colin Wyers      (see all posts) 2011/01/31 (Mon) @ 13:26

But to swing this back around to the original point. I think at this point everyone concurs the original glossary entry I pointed out was misstated. Looking at that page, that entire section was removed:

http://www.fangraphs.com/library/index.php/offense/plate-discipline/

So now there’s nothing on there to alert readers to the issues identified in this thread. If you dig into the comments of one of the linked articles, you’ll find out about the LA data:

http://www.fangraphs.com/blogs/index.php/plate-discipline-stats/#comment-33374

But the findings of league average drift? Or the larger issues of park bias, excluding the LA data?

I’m reminded of this exchange from the Hitchhiker’s Guide to the Galaxy:

“But Mr Dent, the plans have been available in the local planning office for the last nine months.”

“Oh yes, well as soon as I heard I went straight round to see them, yesterday afternoon. You hadn’t exactly gone out of your way to call attention to them, had you? I mean, like actually telling anybody or anything.”

“But the plans were on display ...”

“On display? I eventually had to go down to the cellar to find them.”

“That’s the display department.”

“With a flashlight.”

“Ah, well the lights had probably gone.”

“So had the stairs.”

“But look, you found the notice didn’t you?”

“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard’.”

So let’s consider the average Fangraphs reader. Or better yet, let’s consider the median - the average is probably dragged down by the guys who only stick around waiting for Cameron to post so they can say #6org in the comments. And let’s say my my team has just traded for Mike Napoli (which happens far more often than you’d expect). So I pull up his player card:

http://www.fangraphs.com/statss.aspx?playerid=3057&position=C

And I’m looking at his plate discipline stats, and I go, “Hmm, that’s interesting, his O-Swing numbers this last year were the same as in ‘07, both much higher than the rest of his career.” And so I write up a little FanShot or whatever on how if Napoli can just hold off on swinging at pitches outside of the zone, maybe he can have another season like he had in ‘08.

So how, exactly, is this fan supposed to know that Mike Napoli didn’t swing at more pitches outside the zone in ‘07?


#59    Colin Wyers      (see all posts) 2011/02/03 (Thu) @ 18:00

Bumping, so that people looking at the other O-Swing thread know what I’m talking about.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 05:00
Help needed with sticky issue…

May 25 04:38
The first time a pitcher has ever intentionally thrown at a batter….

May 25 03:39
Lack of hustle during a game

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story