THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, September 24, 2010

Strikeout rates and… that’s all there is

By Tangotiger, 09:57 AM

Had Matt led with his conclusion, it would have been apparent we should have had low expectations:

Of course, strikeout rate for pitchers is one of the quickest to stabilize among all baseball statistics, and so the added value of information beyond knowing historical strikeout rate is least likely to be significant for strikeout rate as compared with any other statistic. Thus, next week I will look at walk rates and attempt to determine whether this type of information can inform our knowledge about walk rates any more than it could have informed us about strikeout rates. 

So, I did have low expectations.  His slicing and dicing and chopping and grinding of the BIS data on Fangraphs leaves us with almost no expectations.

The data, at the least, can be used as “profile” data.


#1    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 10:54

Can it?

Two baseball highlight clips, chosen after about 5 minutes of looking. The first, a Pujols home run in Pittsburgh:

http://mlb.mlb.com/video/play.jsp?content_id=12360933&topic_id=&c_id=mlb

Now, Ichiro’s 200th hit:

http://mlb.mlb.com/video/play.jsp?content_id=12365637

Oh, heck, why not a third. Like anyone who reads the blog doesn’t like baseball highlights. Prince Fielder goes yard:

http://mlb.mlb.com/video/play.jsp?content_id=12384401&topic_id=11493214

When you’re watching, pay attention to the different camera views you’re getting. If we take two players with different home parks - and each park has a different center field camera view - how comfortably can we use stringer observation of the strike zone to compare these hitters to each other? So-and-so may in fact be swinging at more pitches out of the zone, or he may just be playing in a park where the camera placement makes it look like he swings at more out of the zone pitches than the typical hitter.

I mean, maybe it can. But before we say “at the least,” I’d like to see some validation that the camera placement isn’t having a significant effect on the data collection.


#2    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 12:17

If this is a BIS v PITCHf/x issue, then we obviously prefer PITCHf/x, right?


#3    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 12:39

I think that’s a universal area of agreement.

Here’s the more interesting question, to me. Pitch F/X doesn’t exist prior to 2007. The BIS data goes all the way back to 2002.

If you do not have the Pitch F/X data at all, what is going to give you the best information about a batter?

* Using the BIS pitch-charting data (O-Swing, Z-Swing, etc.), or

* Ignoring that data entirely?


#4    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 12:49

How about correlating the BIS data 2008-2010 to the PITCHf/x data?

Take the correlation (r), and regress based on that.  Presume data quality is the same for the pre 2008 data, and apply to all data.


#5          (see all posts) 2010/09/24 (Fri) @ 13:02

Tango/4, would regression help if the differences between BIS and PITCHf/x are primarily systematic (say, park/camera-based) rather than random?


#6          (see all posts) 2010/09/24 (Fri) @ 13:07

Clarifying #5, if we found a systematic and knowable bias, say, for right-handed hitters vs. left-handed hitters, that would be valuable information and we could adjust pre-2008 data on that basis.

However, if the bias appears systematic but its nature is unknown, say, from shifting camera vantage points over time, the power of regression toward the mean for tackling that kind of problem seems very limited.

If the bias is more random, such as individual stringer differences in the case where stringers are rotated frequently, then regression might be successful in improving the data quality.


#7    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 13:20

Given the change in BIS’s data collection practices over time, I’m uncomfortable in assuming that, actually. The O-Swing parameter seems in particular to change over time:

http://www.flickr.com/photos/32431989@N00/4909374614/

This seems to be an artifact of data collection, not a change in behavior by hitters:

http://www.hardballtimes.com/main/blog_article/is-the-bis-data-right/

And I mean, compare even that Pujols video above to his Schumacher video:

http://mlb.mlb.com/video/play.jsp?content_id=12359297

There are a lot of variables to track down in terms of camera positioning, and if that’s a source of bias, it’s unclear of how well that applies in retrospect - cameras are easy to move between seasons, after all.

I mean, can you picture trying to do park factors based upon that kind of thing? Even with a more granular dataset (which means having BIS or maybe Fangraphs doing the grunt work, as all we have are seasonal totals, right?) this is a lot harder than figuring out typical run-based park factors, and even for that we insist on three years of data, right?


#8    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 13:23

If it’s systematic bias, but we are unaware of this, then I don’t see why regression would be hurtful.  Colin is asking if we are better not knowing at all.

Let’s use a more real-life example: say all you had was pitch speed.  You don’t know if it’s a fastball or curve or changeup.  Indeed, all you have is average pitch speed.  You can have say someone who throws alot of changeups at 80 mph, and his fastball is at 90, and you have someone who throws alot of fastball at 90, and very few changeups at 80.  While they have the same speed, pitch per pitch, their overall average would be 81mph for one guy and 89mph for the other guy.

So you have a list of all pitchers like that, and you are asked: who has the fastest fastball?  Well, you don’t know, but can’t you infer it with some degree of certainty using only that data?

Yes, r will not be 1.00, but it won’t be .00 either to fastball speed.  All the non-fastballs are noise to the data, and some pitchers will have alot more noise than another.

What do you do?  Well, you regress the pitch speeds to infer fastball speeds.  Maybe the 89mph guy will regress to 92 and the 81mph guy will regress to 87 or something.  While they should both be 90, we just say that one is 92 +/- 3 and the other is 87 +/- 4 or something.


#9    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 13:32

Okay, let’s talk pitch speeds.

Let’s say all you have is pitch speed data, as recorded by an old-school scout with a radar gun sitting somewhere in the stands. We do not, for whatever, reason, know which scout recorded which data.

And to make it easy, let’s assume all our scouts are using the Stalker SPORT, the most popular radar gun of its type in existence:

http://www.stalkerradar.com/pdf/sport_manual.pdf

Please consult page 11 of the Stalker manual for the chart on angle error.

Let’s say you have three pitchers, with the average pitch speeds listed below:

86
85
83

Each pitcher was recorded by a different scout. You know all of the scouts are positioned between 5 and 15 degrees from the direction the pitch is traveling.

Now - which of those pitchers has the fastest fastball?


#10          (see all posts) 2010/09/24 (Fri) @ 13:48

If you had to pick one pitcher, you pick the guy with the 86-mph reading, even though you’d be wrong some substantial fraction of the time.  But you’d be right more often picking the 86-mph guy than you would be picking the 85-mph guy or 83-mph guy.  I think that’s Tango’s point.

The question is about the size of the error relative to the size of the signal we’re trying to measure.  It’s also about random vs. systematic errors.  If the error is random, we just take larger sample sizes, but that doesn’t help us with systematic errors.  But I believe Tango’s point is that regression helps us in either case.

In the pitch speed example, if pitchers’ fastballs vary by 10 mph across the population, and we can’t measure them within an error of 50 mph, our measurement is pretty useless, even after regression.  But if we get our error down to 3 mph, our measurements could be pretty valuable.

In the case of O-Swing/Z-Swing, we don’t yet know the size of the error, so it’s difficult to make judgments about how useful the pre-PITCHf/x data might be.


#11    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 14:02

If you had to pick one pitcher, you pick the guy with the 86-mph reading, even though you’d be wrong some substantial fraction of the time.  But you’d be right more often picking the 86-mph guy than you would be picking the 85-mph guy or 83-mph guy.

Right, but regression doesn’t change anything, in this instance - these three guys could go out there and throw 300 or 500 or 800 more pitches for these three scouts and the radar readings would stay consistent (assuming each scout is continuing to sit in the same place each time).

So regression may not “hurt,” but it’s not changing the fundamental problem.


#12    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 14:43

Okay, real quick - I took team Zone% for hitting and pitching from 2008-2010 and divided by the league Zone% for that season.

The correlation between Zone% for hitting and pitching at the team level?

0.88.


#13    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 14:58

Right, Mike/10 is exactly what I’m saying.

The ONLY thing regression is going to do is change the spread of the list, while maintaining order.

So, if you have say a list of
89
88
85
83
81

And correlation is weak, then after regression , we’d get back:
91
91
90
90
90

If correlation of average speed to average fastball speed is high, then we’d get back
94
93
92
91
90


#14    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 15:03

The correlation between Zone% for hitting and pitching at the team level?

0.88.

Yowza!  Great way to do it.

We should have expected 0, which I presume is your point?


#15          (see all posts) 2010/09/24 (Fri) @ 15:09

We probably shouldn’t expect a correlation of 0.  Teams might have a preference for similar hitters and pitchers to some extent.

By way of comparison, I looked the correlation in BB/PA and SO/PA for hitting and pitching at the team level in 2010.  It was r=.17 for BB/PA and r=.08 for SO/PA.

An r=0.88 seems very high by comparison.


#16          (see all posts) 2010/09/24 (Fri) @ 15:11

Or beyond team preference for players, the hitting background or other offensive characteristics of the park might tend to correlate hitter and pitcher Zone% on the same team.  Not to large extent, we wouldn’t think, though, and the BB/PA and SO/PA correlations tend to bear that out.


#17    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 15:16

Right, and there could be atmospheric effects on how the ball breaks - expecting a zero correlation isn’t fair.

Expecting one below .5 is probably more than fair, though.


#18    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 16:38

Mike/16, 17: excellent point about the sight lines, etc, and it plays to what I was talking about in the other thread about the home field advantage.

Really, if you want to do it better, correlate the BB/PA and SO/PA only with home data, this way you are comparing the home players to their opponents in the same (single) park.

Just a great idea to look for park bias this way.


#19    Colin Wyers      (see all posts) 2010/09/24 (Fri) @ 17:36

Okay, so let’s be generous (I think) and assume that in a measure of balls in the strike zone, we should expect without bias to see a correlation of .4. So some simple math:

0.88^2-0.4^2 = 0.6144

Okay, let’s bump it up to .5, and…

0.88^2-0.5^2 = 0.5244

I mean, it doesn’t feel to me like we can get around thinking that more than half of the variance at the team level is explained by park bias. And I don’t think that’s the only source of bias that’s possible.

There’s more variance at the player level, so there’s probably more signal there than at the team level. But is there enough signal to make the data useful in spite of the sort of bias we’re seeing?


#20    Tangotiger      (see all posts) 2010/09/24 (Fri) @ 18:26

Is the question are we better off not knowing the data, than knowing the data in all its biased glory?


#21    Matt Swartz      (see all posts) 2010/09/24 (Fri) @ 19:46

I think we’re worse off if people take it to mean more than it does, and better off it people can see it for what it is and only compared players on the same team, who likely have the same bias.  But someone should check if that’s true, I guess.

Keep in mind the O-Swing% was WEAKLY significant and that pitchers with similar strikeout rates who had higher O-Swing% had SLIGHTLY higher strikeout rates the year after.  I take that to mean that it’s probably an important skill in improving strikeout rates, and that it’s probably measured horribly and that is muting an effect of a real variable.  I’d bet if the pitch f/x guys used the data they had over a period of five or ten years, you’d start to see an effect of guys with higher O-Swing% increasing their K’s more.  All this information should be as much of a call to stop trusting the bad data as it would be to try get that data. 

That’s how I feel about a lot of batted ball stuff incidentally.  You need to put it in context but it’s ridiculously useful to know who the fly ball pitchers are when trying to determine who is home run prone given the lack of persistence of home run rates.  But I’d bet you could do a lot better if you figured out which guys are actually giving up fly balls versus line drives.


#22    Colin Wyers      (see all posts) 2010/09/25 (Sat) @ 00:59

Put it this way.

Say we have such-and-such information - in this case, we have a pitcher’s K rate, BB rate, etc. This data is largely unbiased and factual (by which I mean falsifiable).

Now we want to add in this additional data set, which has bias in it. (At this point I am really comfortable saying that there IS bias in the plate discipline numbers from BIS, and that the remainder of the discussion is about figuring out the magnitude and the causes.) We also have random measurement error as well, presumably - although we can compensate for that to some extent with regression or increased sample size.

(This is still a bit different than the first case in that our random measurement error is not reflective of what happened - a pitcher who has a “fluke” strikeout rate over a short sample still had those strikeouts, where with random measurement error that’s not the case.)

And let’s say for the sake of discussion, at least, that we think it’s probable that that sort of data, if it could be collected in an unbiased way, would be useful. That may not always be true (I’m trying to discuss this in a way so that it’s applicable over a broad range of subjects, because I think that these sort of data problems are far from unique).

I think it’s pretty clear that the worst thing that can be done is to treat the biased data the same way you would treat the hypothetical unbiased data. Doing so will almost always, I think, lead you to make larger errors in judgement than if you didn’t have the data at all.

But once you recognize the potential for bias, a lot of interesting questions come up with some not-so-clear answers. I think you can break the problem down to a set of cases:

* If your bias is substantial (and you can’t control for it by correcting the data), and is strongly related to the effect you’re trying to measure, than the data is simply useless. You have no way to differentiate between a purely bias-driven effect and a causal relationship.

* If your bias is substantial (and you can’t control for it by correcting the data), but the source of the bias is essentially unrelated to the effect you’re trying to study, it’s possible that you can use the data to derive general population effects, but be unable to attribute them to individual players.

* If your bias is minimal, or can be well controlled for, then it is possible that the data can still be useful for analyzing individual players.

In order to determine where on that sliding scale you fall, it is necessary to understand both the magnitude of the biases invovled and the causes, right?

The general problem seems to be in dealing with these data sets in a very non-granular fashion, when they come from a collection process that is fairly opaque. In that case, how are we to be confident that we’ve identified the data biases and how they are affecting our results?


#23    Tangotiger      (see all posts) 2010/09/25 (Sat) @ 11:12

Right, we do need to understand the magnitude.  This is exactly what I mean when I say we would need to regress.  If you don’t regress at all, then you are almost surely worse off.  But, what if you regress 99%?  Then, wouldn’t we be better off?

This goes back to my simple example of the observed pitched speed v estimated fastball speed.  Once you know the magnitude in bias and error, you regress, and then you would be better off.

Or are we suggesting the bias is so systematic that we would be worse off?

As an example, let’s say there are 29 parks and Coors.  But we don’t know who plays at Coors.  You are saying that we should regress all the hitting stats by 1/30th, because for 29/30th pf the players, you are worse off, and for 1/30th of the players, you didn’t regress enough.

Is this what you are saying?  In that case, I think you are right.


#24    Colin Wyers      (see all posts) 2010/09/26 (Sun) @ 16:03

Right. Now, regressing to remove random noise may improve the data quality somewhat, but regression is not going to improve biased data - the order is going to stay the same, regardless of how much you regress.

The comparison to parks is apt - in the absence of park adjustments, regression is not a good answer to that problem.

In the absence of those sorts of adjustments, to me what it comes down to is… how much bias can you live with? And it seems to me that the breakeven point is whether or not the magnitude of what’s being measured is greater than the magnitude of the bias being measured. (To be clear: not the magnitude of the performance overall, but the magnitude of the ability to be able to measure that performance - if you have a rather wide range of performance, but low ability to distinguish between performance, then that’s what’s more relevant.)


#25    Colin Wyers      (see all posts) 2010/09/27 (Mon) @ 01:59

I can’t take credit for this idea - Mike Fast suggested it - but a real quick look at bias based upon batter handedness. A look at zone percentage for qualified starters in 2009, according to batter handedness:

http://flic.kr/p/8EkWrg

Is it possible that pitchers throw more strikes to right-handed hitters than left-handed hitters? Sure. But is it possible that there’s a camera angle bias for LHH versus RHH? Sure.


#26    Nick Steiner      (see all posts) 2010/09/27 (Mon) @ 02:43

Colin, if I’m not mistaken the correlation doesn’t mean anything by itself.  It’s a good way of showing bias, but not how much there is.  You asked if it’s better to ignore the BIS data entirely rather than use it in it’s biased form.  For that we need to know the magnitude of the bias, not how consistent it is.


#27    Colin Wyers      (see all posts) 2010/09/27 (Mon) @ 10:19

Actually, Nick, I’m pretty sure we could use the correlation to determine the magnitude of the bias (with respect to park), if we had a good estimate for the correlation the data would show if it were unbiased (again, with respect to park).

All we’d have to do is subtract the squares of the observed and estimated correlations to get the percentage of variance in common not explained by non-bias in common (team tendencies, park effects on the pitched ball, etc.) - figure out the observed variance and take that percentage to give you the variance related to bias.

So the sticking point is really the estimate of the “true” correlation; I gave some real off-the-cuff estimates and I wasn’t too encouraged by the results (given what Mike showed us with K and BB rates, I think I was well on the high side for “true” correlation). But I can understand if you’re not willing to take my WAG for it.

But the batter-pitcher correlation method is the only one I can conceive of, given the way we’re presented the dataset, to estimate the magnitude of the bias. And I can think of no method to, from there, actually work out the effects for individual parks (and again, park bias isn’t the only bias I suspect, it’s just the easiest one to measure).

And when it comes down to:

You asked if it’s better to ignore the BIS data entirely rather than use it in it’s biased form.  For that we need to know the magnitude of the bias, not how consistent it is.

I think the next question to ask is, what if we can’t determine the magnitude of the bias? And I think that’s going to be the case for most things. (Because this isn’t just about this one dataset, I don’t think - the plate discipline stuff is pretty easy to study in this way, compared to the BIS pitch type or batted ball data. And even with the plate discipline data we hit a wall pretty quickly.)

And we simply don’t have enough granularity to provide good estimates of the quality of any of this data. There are people who are paying for this data, and they have that level of granularity. Most of them are obviously disinclined to let outside parties know what they know, though.

So we pretty much have the option of accepting the data on faith until someone with full access to the data provides us with the evidence we need, or we reject the data due to lack of evidence until someone provides the proof that we need to be able to use the data with confidence. I can’t tell you what to do, of course. But I think it’s pretty obvious which of those is good science. And I think it’s pretty obvious which is going to encourage the commercial data providers to be more forthright about their data quality.


#28          (see all posts) 2010/09/27 (Mon) @ 11:07

I took a quick look at what PITCHf/x had to say about the LHB-RHB bias in Zone% according to the rulebook zone.  It finds a similar bias to what BIS has.  Pfx has 42% of pitches in the rulebook zone to LHB and 45% to RHB, with one standard deviation in neighborhood of 2.5-3.0%.

That’s not an in-depth, careful look yet, but since I raised the question of why that bias was present in the BIS data, I thought I should mention that it’s also present in the PITCHf/x data as soon as I observed that.

My suspicion is that it comes down to the difference between the rulebook zone and the actual umpire-called zone.  Umpires call pitches past the rulebook outside edge as strikes more to LHB than they do to RHB.  Since pitchers throw more to the outside edge than to the inside edge, this could make up some or all of the 3% difference between batter handedness.  That is pure conjecture at this point, though it can be checked.

Also, I should note that I did not adjust the zone for batter height in this first pass, though for the whole population it shouldn’t make much difference.  I used 1.7 and 3.5 feet as the lower and upper limits of the zone.


#29    Nick Steiner      (see all posts) 2010/09/27 (Mon) @ 16:11

Why not correlate Zone% based on BIS with Zone% based on Pitch fx?  Or better yet calculate the difference between the two for each player and correlate that by home ballpark?


#30          (see all posts) 2010/09/27 (Mon) @ 16:19

Nick/29, that’s exactly what I’ve been doing.  I’m not sure I’m any closer to understanding what’s going on.

For 2010, the BIS Zone% and PITCHf/x Zone% by home ballpark (which includes home and road for players since we don’t have home-road splits for the BIS data) correlate with an r=0.59.  I wonder if what we’re measuring at the team level is really the percentage of lefties?  I’m not sure, but whatever it is, it’s showing up in both data sets.

At the individual level, PITCHf/x has a little better correlation from 2009 to 2010 than does BIS from 2009 to 2010, r=0.81 compared to r=0.66.  I don’t know if we can make anything out of that, though, unless we figure out what’s going on at the team level.

I may experiment with different strike zone definitions on the PITCHf/x and see if that cleans up the PITCHf/x side any.


#31          (see all posts) 2010/09/27 (Mon) @ 16:39

I wonder if what we’re measuring at the team level is really the percentage of lefties?

So, I have found that it’s not this, at least not in 2010.


#32          (see all posts) 2010/09/27 (Mon) @ 16:44

For 2010, the BIS Zone% and PITCHf/x Zone% by home ballpark (which includes home and road for players since we don’t have home-road splits for the BIS data) correlate with an r=0.59.  I wonder if what we’re measuring at the team level is really the percentage of lefties?  I’m not sure, but whatever it is, it’s showing up in both data sets.

I should have phrased this part of #30 more clearly.  Obviously, some of what’s showing up in both data sets is the actual thing being measured, the in-zone percentage of pitches.  And if we take PITCHf/x zone% as a perfect proxy for that (which it’s not, of course), the problem is that we’re still left with over half the variance at the team level in the BIS zone data coming from some other source, as yet unidentified.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 14:14
Pete Palmer’s new book: Basic Ball

May 25 13:18
Do pitcher’s reach back for velocity when needed?

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 12:40
Largest demonstration in Canadian history?

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion