THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, August 14, 2007

A fascinating study, worthy of some discussion I think…

By , 01:33 AM

Here is the Study

There is a discussion of said article, where I made some comments, on BTF.


#1    David Gassko      (see all posts) 2007/08/14 (Tue) @ 03:10

Mickey,

How about looking at the race of catchers as well? Given how much influence they have on the call, you would expect some effect.


#2    Guy      (see all posts) 2007/08/14 (Tue) @ 09:00

MGL:  I haven’t had time to read through the article yet, but the data in Table 2 seems to suggest that the racial bias—if it exists—is limited to Asian and Hisp pitchers.  White and Black pitchers get the same called strike rate by umpires of all races.  Asian pitchers get lower rate from both Black and Hisp umps (or, less plausibly, a positive bias from white umps).  And Hispanic pitchers may get a slight positive bias from Hisp umps, slight negative bias from Hisp umps. 

As a result, the authors’ statement, quoted in Time,that the highest K rate is white-on-white, and lowest is Black P and White Ump, is quite misleading (although technically accurate).  White pitchers also have a high called strike rate with Bl and Hisp umps, and Black pitchers have a low rate across the board. 

I think the only way to really get a good read on this is the Gameday data, that would allow us to see how umps call pitches in comparison to actual pitch location.  That would allow us to see, for example, if Black pitchers really should be getting so many fewer called strikes than white pitchers.


#3    Guy      (see all posts) 2007/08/14 (Tue) @ 12:34

I added this comment over at BTF:

The more I look at this study, the less impressed I am by the magnitude of the findings.  If you look at table 2 in the study, you’ll see how tiny the differences in called strike % really are.  For example, a black pitcher gets the call 30.62% of the time, with these racial “disparities”:  W Ump:  30.61, H ump: 30.77, B Ump: 30.76.  So, if a black pitcher was judged by a same-race (black) ump 91% of the time (as white pitchers now are), he would gain one extra called strike for every 784 called pitches.  A starter has about 50 called pitches per start, so a black starter might get 2 more strike calls in a season.  Let’s be generous and say that results in 1 fewer walk (or 1 more K) each year (it’s probably less than that), in which case he would give up 1 additional run every 3 years or so.  The same analysis for Hispanic pitchers yields an advantage about twice as great, maybe 5 extra strike calls for a full-time starter.  (And of course, we don’t know if that difference reflects positive bias by the 3 (!) Hispanic umps for Hispanic pitchers, or discrimination by white and Black umps).

The authors’ own estimate of “same-race” advantage, after a lot of complex regression, is just 0.34%, or about 5 calls per season.  So a white pitcher gains perhaps 1 win every 15 years.  And that’s if you buy their regression results.  Using the unadjusted data, a white pitcher who faced only black and Hisp umpires would see a reduction in called strikes of 0.15%, which means he would lose about 2 calls per season or maybe 1 win every 30 years.

Also, although the authors control for a boatload of factors in their regressions, it doesn’t appear that they control for park effects.  That probably doesn’t matter in the aggregate, but it could have a big impact on the Questec/non-questec part of their analysis.  Once you start looking at same-race matchups for minority pitchers (for which the sample is small to begin with), separately by Questec/non-questec, you almost certainly could see a real park effect.


#4          (see all posts) 2007/08/14 (Tue) @ 14:49

I comment at my blog post here

My main point is that I have no idea how they get statistical significance given the small sample sizes in the black/black and hispanic/hispanic cells.  If anyone can clear that up, much appreciated.


#5    MGL      (see all posts) 2007/08/14 (Tue) @ 17:34

Hmmm, I thought that the overall difference between same and difference races was double the .34%, or .68% which is quite significant given the totality of the data (2 mil pitches of which around half are “called pitches").

Honestly I don’t understand all the regression stuff.  If it were me (and I am going to do it) I would simply look at all the called strike rates for the various combinations of pitchers/umpires and adjust for any differences in park, home/road, pitchers, and batters faced.  That is essentially what they are doing with the regressions, but regressions are always “black boxes” (at least to non-statisticians), by definition.  That is acutally one reason why I like probit models as they are more intuitive.  If I want to test if there for any bias in umpiring, I would simply look at the results for the various groups (again, adjusted for any differences in pools of batters/pitchers/parks/home and away) and then test the significance of any differences using the standard error of the binomial.  It is easy to do that with ball and strike calls.

I would be surprised if the catcher had any effect.  Why would they, considering that the batters apparently has no effect?  I agree with the authors conjecture that in calling a pitch, the umpire focuses much more on the pitcher and not the batter.  When I watch a game, I do the same thing, at least in terms of whether I think a pitch is a ball or strike.  I certainly don’t pay any attention to the catcher when thinking about whether a pitch is a ball or strike and I doubt the umpires do either.


#6    David Gassko      (see all posts) 2007/08/14 (Tue) @ 18:42

Catchers talk to the umpires a lot more than pitchers do, I think. If umpires sub-consciously prefer people of the same race, how much they like the catcher (and therefore how willing they are to extend the strike zone a bit) would be influenced by their race.


#7    MGL      (see all posts) 2007/08/14 (Tue) @ 20:25

I see your point (#6), but I don’t think there is enough connection between the pitch and the catcher.  I think the focus of the umpire when he makes a call is the pitch and the pitcher.  The fact that the batter appears to have no influence bears this out.  If batters had an influence then I would be more optimistic that catchers would too.


#8    MGL      (see all posts) 2007/08/16 (Thu) @ 03:58

Well, I’ve been working on my own study for a few days now, and I have some preliminary results.  I’m afraid they are not in accordance with those of the cited authors and study above.

I also looked at all called pitches, called strikes and balls only, from 04 to 06, using the Retrosheet game logs.

I classified all umps who umpired (home plate) at least one game in 04-06 as white, excepting the following:

Black
Bucknor, Danley, Diaz, Meriwether

Hispanic
Hernandez, Marquez

Asian
None

I also went through every pitcher who pitched to at least one batter in 04-06 and classified them as white, black, hispanic, asian, or unknown.  I did not have a database of race or ethnicity as I thought I did, only where each MLB player was born.  As the authors above mentioned, lots of the hispanic players look black but are hispanic (born in Latin American countries).  I have no idea whether umpires would consider them black or hispanic or both, for purposes of this study, but I only classified a black pitcher if he was NOT hispanic.

I did most from memory and recollection, so I probably goofed on a few (maybe 1% out of around 1000). I don’t think that should matter much.

I found only 12 black pitchers, 205 hispanic ones, 23 asian, and the rest, 654, were white.  I was really surprised how few black pitchers there are.  I think only 2 or 3 are starters.

Anyway, for each group of umpires, I found their baseline strike percentage against all pitchers they “faced.”

Whites: 31.79% (around 600,000 pitches)
Blacks: 31.40 (around 36,000 pitches)
Hispanics: 32.40 (around 18,000 pitches)

Overall, 31.78% of all called pitches are strikes.

Now, since white umpires umpire the vast majority of games, the pitchers they face are pretty much all pitchers.  That is not true of the black and hispanic umps though. They might have faced pitchers who were particularly high or low strike % pitchers and that may be why black umpires have a lower strike % and hispanic ones have a higher strike % (than all umpires combined or as compared to white umpires).  Let’s see:

The pitchers faced by the black umpires have an overall strike percentage for the 3 years (when facing ALL umpires) of 31.525%, so that may be why they have such a low strike % themselves.  Assuming that the pitchers they face are low strike % pitchers overall, to the tune of .9920 of the average rate for all pitchers, then the “adjusted” (for pitchers faced) strike % for black umpires is now 31.65, still a little lower than all umpires.

The hispanic umpires, all 2 of them, faced pitchers who had an average strike % overall of 31.75, just a tad less than the average pitcher.  If we “pitcher adjust” their K %, we get 32.43 rather than 32.40 (the unadjusted rate).  So here is the above data again, for each group of umpires, adjusted for the pool of pitchers they faced in 04-06:

Whites: 31.79% (around 600,000 pitches)
Blacks: 31.65 (around 36,000 pitches)
Hispanics: 32.43 (around 18,000 pitches)

Now here is the deal.  If we look at all games in which the umpire and the pitcher are the same race/ethnicity (w/w, b/b, h/h), we expect the strike percentage (if there is no racial bias) to be 31.77 and it is 31.74, a difference which is not even close to being statistically significant.

If we look at all matches where the pitcher and umpire are of different races/ethnicity, we expect a strike % of 31.30 and we get 31.25, also essentially the same rate.

In a future post, probably tomorrow night, I will give the results for all the combinations.  The B/B and H/B (umpire/pitcher) are obviously rare enough so that any small differences between expected and actual K % will probably not have any significance.  Even the B/A and H/A are rare enough that you would need large differences between expected and actual to have any statistical significance.

I am also going to have to double check my data and my methodology as I have been doing this work piecemeal and I have not been paying too much attention to being meticulous.

Anyway, if my results hold up, I would have to say that there is something seriously wrong with the original study above.  Unfortunately, while my numbers and results are fairly easily duplicated, we cannot say the same for theirs, which, as I said, is one reason why I do not like complex regressions, when “pure” numbers, like I have presented, will more than suffice.


#9    tangotiger      (see all posts) 2007/08/16 (Thu) @ 10:39

MGL, when you say:

If we look at all matches where the pitcher and umpire are of different races/ethnicity, we expect a strike % of 31.30 and we get 31.25, also essentially the same rate.

Is that based on the identity of the pitchers/umpires.  That is, it just so happens that mixed-race pitchers/umpires happen to have, OVERALL, lower strike percentages?

It will be interesting to see the final chart by the 9 combinations (if you can provide both the “n” and the “rate”, that’d be great).


#10    Guy      (see all posts) 2007/08/16 (Thu) @ 10:46

Great work, MGL.  I expect your numbers will hold up, for this reason:  The overall spread the authors found between same-race and mixed-race matchups (no regression) was .006.  But if you simply adjust for the racial makeup of the pitchers in those two samples—same-race is about 98% white, while mixed-race is predominantly Hispanic, and white pitchers have higher strike%—you get an expected spread of about .006.  (White umps also dominate the same-race sample, and they call slightly more strikes in their sample.) So the “bias” is 100% created by their regression, which as you say is a bit of a black box. 

My own guess is that the fixed effects methodology does not successfully control for race of pitcher, for reasons I don’t understand.  As a result, their “same race” variable becomes a partial proxy for pitcher=white, and since white pitchers have higher strike % the coefficient is positive. 

* *

To go back to the batters issue, if umps were biased I think it’s just as likely that batter race would mattter, based on physical proximity and fact that hitters more obviously complain about ball/strike calles.  And where I’d most expect to see it is where ump is same race as hitter but not pitcher, or vice versa, so the ump has to “choose sides” in racial terms.  (The greatest racial bias against black defendants by white jurors is when victim is white.) Yet the authors tested for that and found no impact.  To me, that is strong evidence that the pitcher-only bias they find is a statistical artifact, not real. 

I also find it very troubling the way they exaggerate the impact of their finding, describing it as about one pitch per game (and calling the .0034 impact a “1% increase” based on average strike% of .30).  Even if they are right, a .0034 difference and about 70 called pitches per game means that a black or Hispanic pitcher is losing about 1/5 of a call every complete game compared to a white pitcher.  (You don’t double-count the impact, because the .0034 is already compared to a pitcher facing opposite race ump). And on top of that, they report themselves that there is no bias on any 2-strike or 3-ball pitch, which has to significantly reduce the real damage/benefit of any bias. 

Finally, an impact of 1 pitch per 5 games clearly cannot give hitters a meaningful advantage/disadvantage that would impact HR rate and other offensive production.  The fact that the authors nevertheless find such differences, and a difference in overall win%, is therefore not further evidence of bias, but quite the contrary:  it’s evidence that their models are failing to control for all essential factors, most likely pitcher race.


#11          (see all posts) 2007/08/16 (Thu) @ 11:56

Guy, very nice summary of the issues.  The more I look at this, the more I think that, yes, maybe it’s just an error in the study, failing to control for pitcher race somehow. 

And since you bring up HR rate, etc. ... on the last page of the study, in Figure 4, you see that minority pitchers do better in every category when facing a matching ump, and white pitchers do better in every category except strikeouts (which are neutral) when facing a white ump.

The differences are very large.  Minority pitchers give up 14% fewer home runs when facing an own-race umpire!  The sample size here is 17,000 (actual, not called) pitches.  Say 17K pitches is 120 nine-inning games.  At 1 HR per game, that’s 120 home runs.  Minority pitchers with matching umps gave up 17 less.  OK, I guess that could be coincidence.  The authors do say that the results aren’t statistically significant. 

But what about the white pitchers?  The non-white-ump sample size is about 130,000 pitches.  Say, 950 games.  An ERA .2 lower in 950 games?  That seems significant, although I’m too lazy to test.


#12    MGL      (see all posts) 2007/08/16 (Thu) @ 12:17

Yes, I was very disturbed by the large effecr on ERA, win percentage, HR rate, etc., which did not seem to jive with the small differences they found in K %.

And as Guy says, those K % differences would not make much of a difference in runs, etc., if they were only found in non-terminal counts.

I was also disturbed by the fact that if you look at their charts you see reverse bias for everything in Questec parks, high attendance, and on terminal counts.  IOW, it appears that umpires go out of their way to favor pitchers of a different race/ethnicity when they were being scrutinized.  I find that a little hard to believe.

I also was a little skepitcal of the idea that any scrutiny, especially the attendance level, would significantly change their behavior.  After all, the differences they found were so small that I doubt that Questec, the crowd, etc. would be able to “notice” it.

Anyway, there are a couple more things I have to do besides double check the data/methodology.  One is to control for home/road, which I did not do yet, just in case some of the umpire/pitcher combinations had a disproportianate percentage of home or road pitchers.  There is a pretty big difference in the K % for home and road pitchers, 31.98% versus 31.59%. 

And yes, the reason for the low K% for the mixed race situations is that the pitchers happened to be low K% pitchers.  My “expected” K % is always the umpire’s overall K % divided by (odds ratio is probably better, but it probably does not make much difference) the pool of pitchers faced normalized K%.  Basically the “pitcher adjusted” umpire K%, like a park factor.

Maybe I can send the raw data to someone else, if they want to volunteer, and they can double check the work.


#13    Guy      (see all posts) 2007/08/16 (Thu) @ 12:57

Phil/11:  I agree those look like real differences.  And perhaps there’s something there, although it’s hard for me to see how the home plate ump can raise or lower HR rates.  However, it does seem possible that we’re seeing League, home/away, and/or park effects here. It could be that minority umps have not been perfectly randomly assigned over this period (it’s only 8 guys).  And certainly, the same-race matchups for minority pitchers could easily be skewed.

Interestingly, to the extent the unadjusted data reveals any bias, it looks mainly like bias by minority umpires (for same-race pitchers, and against other minorities), whereas the authors focus almost entirely on how the bias hurts minority pitchers. 

MGL/12:  I agree that the Questec results are puzzling.  If you look at fig. 1 and look at the robust samples (white umps only), it basically shows slightly fewer strikes in Questec parks, as we’d expect.  What varies a lot are minority umps when you break them down by pitcher race and questec/nonquestec.  At that point, you could be seeing park/league/home-away effects.  Questec is of course not random, but installed in specific stadiums with specific home teams.


#14    tangotiger      (see all posts) 2007/08/16 (Thu) @ 14:03

Umps may have a park bias (maybe some parks are better for them), plus their distribution of parks are not uniform. 

From 2000-2006, Bucknor has 13 games in Toronto, 4 in Oakland/Milwaukee.  Danley has 12 in the (real) LA, 4 in Atlanta.  Laz Dias has 13 in Denver, and he’s 25% of the black umpires!

So, I think you definitely would need to control for the park.


#15          (see all posts) 2007/08/16 (Thu) @ 14:05

Tango/14, good catch!  I looked up ump assignments and found they seemed to be evenly distributed by team, but I didn’t bother checking home/road.


#16    Guy      (see all posts) 2007/08/16 (Thu) @ 14:43

And remember that a big part of this analysis is being driven by same-race results of minority pitchers.  It’s certainly possible, for example, that Hispanic-Hispanic matchups (which are 80% of same-race matchups for minority pitchers) have tended to occur in pitchers’ parks, NL parks, and/or home parks, thereby suppressing HR and BB rates.


#17    MGL      (see all posts) 2007/08/16 (Thu) @ 20:37

#14, I don’t know what you mean about home/road.  Tango didn’t say anything about that.

To control for park would be a mess.  There are very few K% true park effects.  Whatever differences you find among parks would have to be heavily regressed otherwise you are introducing a lot of noise.

I don’t think that you can find fruitful, significant results with so few minority umpires and so few black pitchers.

I also need to control for the batters, which I forgot to do.


#18          (see all posts) 2007/08/16 (Thu) @ 23:08

Oh, right, it shouldn’t matter much what park the ump is in.  Never mind!


#19    MGL      (see all posts) 2007/08/16 (Thu) @ 23:31

I am out of town for a day or so.  I’ll have more data probably on Friday night.


#20    Guy      (see all posts) 2007/08/17 (Fri) @ 07:10

MGL:  I agree park doesn’t matter for your analysis.  I was talking about the authors’ later analysis of variations in pitchers’ HR rate, ERA, etc. based on race of umpire.  For that analysis, park, league, and home/away could have a distorting impact on results for minority pitchers and/or minority umps.


#21    Guy      (see all posts) 2007/08/17 (Fri) @ 12:11

There’s an interesting discussion stimulated by this paper at The Sports Economist, http://thesportseconomist.com/2007/08/calling-strikes-discrimination-in.htm, focused on whether researchers should make public work that hasn’t yet been fully vetted.  I made this comment there:

“I think the topic and conclusions of the research play a role in determining the author’s obligation for care and prior review. In this case, the bar should be set very high: the claim that “umpires are racist” is newsworthy and gets lots of attention, while the contrary finding would be ignored. Unlike a paper on, say, the economic impact of subsidized stadiums, there are no experts available to reporters to refute the finding (unless MLB tries to, which they’ve wisely refrained from). And even if a definitive refutation of this paper comes along later, it will be little noticed (outside academia).

As a result, fans now “know” that umps are racially biased, and it will be very hard to change that perception. Researchers have a duty to be extremely careful with a topic like this one, because you can’t unscramble this egg.

That’s quite unfortunate, because the authors’ conclusions are significantly overstated, if not wrong. They (and the media), describe the impact as “about 1 pitch per game.” In fact, it is about one pitch per 5 games: having a same-race ump increases the called strike rate by .0034, and there are about 70 called pitches per game.

Look at table 2 in the paper if you want to see the magnitude of the bias we’re talking about. White pitchers get a strike call .3206 from white umps, and .3192 from minority umps (a difference of 1 pitch in 700). If you re-run the numbers assuming that Hisp pitchers faced same-race umps 91% of the time (as white pitchers now do), their called strike rate would increase .0029. That means 4.6 pitches over a 200 inning season, or about one-third of a run. For an average pitcher, his ERA would drop from 4.20 to 4.19! Moreover, the change for black pitchers facing 91% black umps is only half this great. Or turn it around and assume white pitchers faced opposite-race umps 95% of the time, as minority pitchers do today: again, the impact is less than half of the Hispanic-Hispanic example.

There are good reasons to doubt whether this data in fact reveals any discrimination at all. But these issues have been addressed elsewhere. My point here is that, even if you completely accept the authors’ finding, their interpretation has been exaggerated and somewhat irresponsible, given the extreme sensitivity of their topic. Ninety three professional umpires have potentially been libeled, and it’s not clear who will clear their name, or how.”


#22    tangotiger      (see all posts) 2007/08/17 (Fri) @ 12:21

I don’t see a comment link on that link.  However, I do see it from the homepage:
http://www.haloscan.com/comments/skipsauer/6714267678478680720/


#23    MGL      (see all posts) 2007/08/17 (Fri) @ 18:50

Guy, #20, gotcha, sure.

The thing is, if you use the HR, ERA, runs allowed, and win percentage data, I think you get a much higher effect than what you would expect from their K% rate differences, right?  Is there anywhere where they give the confidence intervals, p-values, signficance levels, etc. for those things?  I mean, if those values are significant (and they did in fact properly control for parks, etc.), then perhaps they could justify the claim that umpires are significantly biased and that it really hurts a pitcher with a different race umpire behind the plate and that therefore the minority pitchers are really at a disadvantage, given that most umpires are white.

Honestly, I don’t know what it means to include “fixed effects” in a regression, so I have no idea whether they are properly controlling for the things we have been talking about, some or all of which they mention.

To be fair to the authors, I think the study is well-written and fairly well thought out.  I am not even sure whether the authors themselves use the word “racist” which I don’t like even if there is some bias in their calls.  “Racism” usually implies a conscious, deliberate act of racial prejudice or discrimination.  This kind of bias is something that can (and probably does in many cases) easily come from the most non-racist persons in the world.

As far as what the authors named the article, I have no idea what that means.  Something in the article talks about it though.

There really are only two things that stood out to me in the FAQ.  One was that they said something like the “biased findings they found have almost no probability of occurring by chance.” IOW, they really overstated (I think) the statistical significance of their results.

And two, it has nothing to do with the study, but they misused the word “unconscious.” They meant “subconscious” but they kept writing “unconscious” which means while sleeping, in a coma, etc.  It may seem like some umpires operate while sleeping or in a coma, but I don’t think that is what they meant.

Anyway, maybe someone can contact someone at BTF and get another thread going about refuting or commenting on this study.  There was an original thread on the subject but it is long dead and buried.

If someone can do that, please wait until I provide more data here and also double check the accuracy of that data.


#24    MGL      (see all posts) 2007/08/18 (Sat) @ 20:57

O.K., I scrutinized my programs, adjusted for batter K%, and found some bugs. I did not do any park adjusting. Here is the data:

League-wide, there were around 1,124,941 called pitches.

31.90% of these were called strikes.

For home pitchers, it was 32.090%.  The throw more strikes as compared to balls.

For visiting pitchers, it was 31.715%

The home pitchers have more called pitches, because they pitch the extra half inning more often.  570,010 called pitches were thrown by the home pitchers and 555,009 by the road pitchers (50.7% of all called pitches are by home pitchers).

Here is the basic overall data for the 2 Hispanic umpires, Hernandez and Marquez:

N pitches = 30,934
.5093 are from home pitchers (.5067 is average)

32.00 K%

Their average pitcher faced = 31.9
Their average batter faced = 32.0

So, overall they faced a 31.95 group of batters and pitchers, so a slight adjustment is done to their K% of 32.00, giving them a pitcher and batter adjusted K rate of 31.95%. 

Here is the basic overall data for the 4 black umpires, Meriwether, Diaz (actually Jamaican I think), Bucknor, and Danley.

N pitches = 61,929
.5055 are from home pitchers (.5067 is average)

31.60 K%

Their average pitcher faced = 31.8
Their average batter faced = 31.9

So they faced a pool of batters and pitchers who threw fewer strikes per called pitch.  Assuming that these umpires did not substantially “cause” that, we would have to adjust their 31.6 K% to 31.65%, still lower than the average ump.

For the white umpires, 93% (87 of 93) of all umpires, we expect their numbers to basically be equal to the overall numbers for umpires (and batters and pitchers of course).

White umpires K% = 31.9
N= 1,032,156
Home % = .5067

The pitchers and batters they faced are both 31.90 overall, as we expected, so no adjusting needs to be done.

Now to the various combinations.  The “expected” columns are the expected K% of the umpires in that combination, based on those umpires overall adjusted K rates and the batters and pitchers they faced as well as a further adjustment for % of home pitches.  “Actual” is the K% they actually called.

To summarize the batter and pitcher adjusted K rates for the three groups of umpires:

White: 31.9 (same as all umpires on the average)
Black: 31.65
Hispanic: 31.95

Here are the various combinations of umpires/pitchers, at all parks. Umpires are first and pitchers are second.

Black/Black
N pitches: 1,244
Expected: 31.70
Actual: 31.10
1 SD = 2%

Black/Hispanic
N: 13,753
Expected: 31.30
Actual: 30.80
1 SD = .4%

Black/Asian
N: 1,709
Expected: 31.2
Actual: 30.00
1 SD = 1%

Black/White
N: 45,102
Expected: 31.69
Actual: 31.9
1 SD = .2%

Hispanic/White
N: 22,851
Expected: 32.10
Actual: 32.00
1 SD = .3%

Hispanic/Asian
N: 698
Expected: 31.99
Actual: 30.50
1 SD = 1.8%

Hispanic/Black
N: 495
Expected: 32.45
Actual: 32.90
1 SD = 2.1%

Hispanic/Hispanic
N: 6,808
Expected: 31.74
Actual: 31.90
1 SD = .6%

White/White
N: 742,134
Expected: 32.00
Actual: 32.10
1 SD = .05%

White/Non-white
N: 289,985
Expected: 31.65
Actual: 31.50
1 SD = .09%

White/Black
N: 18,056
Expected: 31.85
Actual: 31.80
1 SD = .3%

White/Hispanic
N: 241,271
Expected: 31.65
Actual: 31.40
1 SD = .09%

White/Asian
N: 29, 513
Expected: 32.05
Actual: 32.00
1 SD = .27%

Same race/ethnicity
N: 750, 186
Expected: 32.00
Actual: 32.10
1 SD = .05%

Different race/ethnicity
N: 374, 796
Expected: 31.70
Actual: 31.50
1 SD = .07%

The only combos that have enough sample size for any small differences to be meaningful are the white/white, white/Hispanic, white/non-white, same, and different.
The difference between actual and expected suggests a positive bias and is significant at the 2 sigma level.  For white/non-white, the difference also suggests a bias, but is only significant at the 1.5 sigma level.  White/Hispanic also suggests bias and is significant at the 2.5 sigma level.  Same race suggests positive bias and is significant at the 2 sigma level.  Different race numbers suggest a bias at the 3 sigma level.

The largest differences are .2 to .25%.  For a typical starter who throws around 2500 pitches per year, around 1300 will be called pitches.  .2 to .25% is 3 extra strikes that would be balls.  Again, assuming about .1 runs for every extra strike (that would be a ball), we have around .25 extra runs per year as the difference between a white and non/white starting pitcher.  That is a .01 difference in ERA, the same thing that Guy found in the numbers in the study.

Questec and non-Questec park breakdowns in next post.


#25    MGL      (see all posts) 2007/08/18 (Sat) @ 20:58

Here is some more data broken down into Questec and non-Questec parks:

As far as I am aware, the parks with Questec installed are: Angels, Arizona, Boston, Cleveland, Houston, Milwaukee, Mets, Yankees, Oakland, and Tampa.

Questec Parks
N pitches = 377,423
K%=31.9

Non-Questec
N pitches = 747,596
K%=31.9

So there is no difference in overall K% in Questec and non-Questec parks.  Here are the umpire/pitcher combinations broken down by Q and non-Q parks.

Questec Parks

White/Non-white
N: 193,303
Expected: 31.65
Actual: 31.50
1 SD = .1%

White/White
N: 491, 699
Expected: 31.95
Actual: 32.10
1 SD = .06%

Black/Black
N: 993
Expected: 31.60
Actual: 30.80
1 SD = 1.5%

Black/Non-black
N: 42,241
Expected: 31.55
Actual: 31.50
1 SD = .2%

Hispanic/Hispanic
N: 4,170
Expected: 31.89
Actual: 32.50
1 SD = .7%

Hispanic/Non-Hispanic
N: 15,153
Expected: 32.06
Actual: 31.50
1 SD = .4%

Same race/ethnicity
N: 496,862
Expected: 31.95
Actual: 32.10
1 SD = .07%

Different race/ethnicity
N: 250,697
Expected: 31.75
Actual: 31.50
1 SD = .09%

Questec Parks

White/Non-white
N: 96,682
Expected: 31.65
Actual: 31.40
1 SD = .15%

White/White
N: 250, 435
Expected: 32.05
Actual: 32.10
1 SD = .09%

Black/Black
N: 251
Expected: 32.20
Actual: 32.50
1 SD = 2.9%

Black/Non-black
N: 18,444
Expected: 31.75
Actual: 31.80
1 SD = .34%

Hispanic/Hispanic
N: 2,638
Expected: 31.55
Actual: 31.00
1 SD = .9%

Hispanic/Non-Hispanic
N: 8,973
Expected: 32.14
Actual: 32.70
1 SD = .5%

Same race/ethnicity
N: 253, 324
Expected: 32.05
Actual: 32.10
1 SD = .09%

Different race/ethnicity
N: 124,099
Expected: 31.70
Actual: 31.60
1 SD = .13%

There is a slight difference between the biases we find overall and that in the Questec and non-Questec parks, but I doubt that these differences (between the Q and non-Q parks ) are significant, unlike the results the original authors found.  I am not surprised, as I did not think that being in a Questec park (or not) would affect such small and presumably subconscious decisions.

I did not look at the effect of park attendance.

My results are similar to those of the authors, and I would have to come to the same conclusion (that there is some small racial bias), but the difference created by such a bias is likely so small as to not make much of a practical difference in terms of an advantage for pitchers/teams when a same race/ethnic umpire is behind the plate or disadvantage when the umpire is of a different race/ethnicity.


#26          (see all posts) 2007/08/18 (Sat) @ 22:10

mgl, could you give us a paragraph or so on how you got the white/white expectation in the non-Questec case?

For instance, something like “On average, the white umpires called X strikes, the white pitchers threw Y strikes, and the league average is Z, so we calculate it this way ...”


#27    MGL      (see all posts) 2007/08/18 (Sat) @ 22:37

For all of the “expectations” I took the overall rate of the group of umpires, in this case, the white umpires called 31.9% strikes for all parks and all pitchers, and then adjusted by the pitchers and batters they faced in a particular subset.  The white/white is not a good example, because white/white is most of the data, although the white pitchers throw more strikes than the pitchers overall, so there is some adjustment.

Anyway, the white umpires had an overall K% of 31.90 (same as all umpires combined since 93% of them were white).  The white pitchers they faced (I’m not sure if this was in Q, non-Q, or all parks) were 32.20, higher than all pitchers combined.  The batters they faced were 31.90, same as all batters combined.  Anyway, I took the average of the batters and pitchers, which in this case is 32.05.  Then I took the ratio of batters and pitchers faced to overall batters and pitchers, which in this case is 32.05/31.90.  That is the “adjustemnt factor.” IOW, they faced higher than normal K% batters and pitchers so we would expect them to call higher than normal K%.  Their overall K% (31.90) is multiplied by the adjsutment factor of 1.0047021, to get 31.75. That is the expectation.

BTW, I just wrote to one of the authors and sent him my results.  I also asked him for a list of his umpires and pitchers.  I only had 2 Hispanic umps, Marquez and Hernandez.  They had 3.  I only had 4 black umpires.  They had 5.

They also had way more black pitchers than I did and way more Asian ones.  Here is what I and they had:

White
me: 678
them: 669

Hispanic
me: 210
them: 219

Asian
me: 23
them: 29

Black
me: 0
them: 27

I think they called a lot of the hispanic pitchers black. Basically, if a pitcher was born outside of the U.S. in an Hispanic country, I called him Hispanic, even if he looked black.  I thought that is what they did, but maybe not.  Or maybe I just missed a lot of black pitchers.  I did it by memory and to some extent the names for pitchers I did not know.

I think that in order to compared studies, we at least ought to get closer in terms of the classificationsof the pitchers and umpires.  Having an extra black and Hispanic umpire makes a big difference.


#28    MGL      (see all posts) 2007/08/18 (Sat) @ 22:39

Oops, that “me” for black should be 9 and not 0!


#29          (see all posts) 2007/08/18 (Sat) @ 22:44

>“The white pitchers they faced ... were 32.20 ... batters they faced were 31.90, same as all batters combined.  Anyway, I took the average of the batters and pitchers, which in this case is 32.05.  Then I took the ratio of batters and pitchers faced to overall batters and pitchers, which in this case is 32.05/31.90.”

I don’t think you want to average the batters and pitchers.  You want to do two multiplications, one for pitchers and one for batters: first, 32.20/31.90 times 31.90/31.90.

Look at it this way: the batters don’t affect the results at all, so the adjustment for batters has to be 1.000.  But you’ve used the fact that the batters are average to reduce the pitcher adjustment.

What do you think?


#30          (see all posts) 2007/08/19 (Sun) @ 00:00

Okay, I think I may have figured out what’s going on with your study and the original Hamermesh study, and why both of you are coming up with statistically significant estimates of bias when I (still) think there’s no evidence of bias showing.

I think your study (mgl) is making several hidden assumptions.  I’m trying to figure out what all those are, but for now I have pinpoined one.  It’s the assumption that pitchers will keep their exact proportion of strike calls even if umpires are unbiased.  If white pitchers throw X% strikes with biased umpires, they will also throw X% strikes with unbiased umpires. 

Intuitively, the assumption is obviously false.  It’s quite possible that white umpires are perfectly fair to black pitchers, but give a little extra help to white pitchers.  If that’s the case, then white pitchers will throw LESS than X% strikes if the umpires become unbiased. 

But your study assumes the percentage stays equal.  (You can check that.  Ignore the adjustments for batters and home field for now, and figure out the *actual* strike percentage for white pitchers, and the *expected* strike percentage.  I think you’ll find they’re the same.)

The assumption can be put another way: “For all umpires, their bias in favor of their own race must be opposed by an opposite bias in favor of the other races, in exact proportion to the number of pitches seen by each.” This *must* be false.  Here’s a proof by contradiction.  Suppose it’s true.  And suppose that after 161 games, white pitchers have 10 times as many pitches as minority pitchers.  And suppose that umpires favor white pitchers by Z%.  Then, by the assumption, they disfavor minority pitchers by Z/10 percent. 

Suppose now that on the 162nd day, only minority pitchers throw, and are responded to with the usual bias.  Now minorities have *more* than 10% of pitches, so now the equation doesn’t balance.  The assumption is then false after the 162nd day, which is a contradiction.

This is a big assumption, although a hidden one.  And it’s false.  So the assumption limits the possible “curves” (actually, matrices) that you are permitting to fit the data.  And so when you find statistical significance, you are finding that the probability of seeing those real-life data GIVEN the assumption.  Since the assumption is not realistic, the significance level isn’t relevant.

I think this is related to what I said on one of my blog posts (http://sabermetricresearch.blogspot.com/2007/08/alternative-significance-test-for.html).  There are lots of possible “expected” matrices that would show no significance, but the matrix produced by the method used here (and probably in the Hamermesh study) is not one of them. 

Here’s one intuitive explanation of why the adjustments don’t work Suppose you find that white umpires call strikes A% above average, and white pitchers receive strikes B% above average.  You would think that white umpires should call strikes on white pitchers A% times B% above average.  And that’s true IF umpires are unbiased.  But if they’re biased, then there is no reason to expect that either A% or B% is the “correct” value if pitchers are unbiased.  So when you use A% and B% to calculate what the signficance level would be *if umpires were unbiased*, that number doesn’t mean anything.  Because if it’s significant, and you conclude the umpires were biased, then the A% and B% were wrong in the first place.

Anyway, I need to think some more about this, and figure out a more elegant way to explain it, but I’m pretty sure this is what’s going on.


#31    MGL      (see all posts) 2007/08/19 (Sun) @ 00:28

Phil, you are right in #29.  I need to do them separately.

Let me read #30 and I’ll try and comment.


#32    MGL      (see all posts) 2007/08/19 (Sun) @ 00:32

Sorry Phil, maybe I am tired, or maybe I am not smart enough, but I have no idea what you are talking about.


#33    MGL      (see all posts) 2007/08/19 (Sun) @ 00:57

I just read your post linked to above, and with all due respect, Phil, and I am no statistician, I have more than a sneaking suspicion that these PHD’s and professors (who did the original study) know what they are doing, statistics-wise, and that you do not.  I presented my data, and I’ll let any statisticians if they want, review it for significance or whatever.  As I have said many times in the past, and I think you have agreed, adherence to this “if it meets the 2 sigma level of significance, I can make my point, but if it doesn’t, I can’t,” is ludicrous.  ANY differences are EVIDENCE of bias, especially if the bias MAKES SENSE and simlar-type subconscious biases are found in other studies where one person deals with same and different race people (remember these are all Bayesian problems when we can make some common sense assumptions or where have a priori data from similar-type studies).  Now, how much certainty we want to attach to that evidence in terms of concluding the likelihood of “true” bias, is up to the individual to decide, depending upon what he wants to do with the information.  Regardless of whether we find results that are significant at the .5 sigma level or 2 sigma level, I am confortable with saying that both the authors and I found evidence of racial bias, with the caveat for “civilians” who are not familiar with emprirical studies being that there is some chance, be it 5% or 20%, that there is no bias and these results occcured by chance.  As well, it is likely that if there is bias, that it is NOT exactly of the magnitude we found, but the further we go from the “mean” difference that we found, the less likely it is that that is the true difference.  Let’s NOT get stuck on this being a binary result - there either IS bias or there IS NOT, based on whether the results pass some arbitrary test of significance.  That being said, I am not sure what the signifcance level of the authors’ or my results are, and it certainly depends on what result we are looking at (difference between overall and same race K%, difference between whites and blacks only, etc.).  I am reasonably sure that the overall difference that I got between ALL same race combinations and different race combinations is significant at or near the 2 or 3 sigma level despite what you wrote in your post above.  In that case, we simply have 2 cells - one for same and one for different with two different means, with the standard error of the difference being the square root of the sum of the binomial variances, which is about as far as my statistical knowledge goes (binomial variances).  The standard error of the difference is .084% and the difference between the two cells (same and different races) is .3%, which is more than 3 standard errors.  Where I got the .3% from was that for same race combos, they call .1% more than expected and for different race combos they call .2% less than expected, for a differnce of .3%.  I THINK that is the number that is subjected to the standard error of the difference between the two samples.


#34    Guy      (see all posts) 2007/08/19 (Sun) @ 08:58

I haven’t expended the necessary energy to form opinion on the statistical significance debate.  But Phil/29 is clearly right, and could affect your results significantly.  For example, you have the expected Wh/Wh rate as 32.00, making the actual 32.10 rate signif. higher than expected.  However, in post 27 you mention that white pitchers overall are 32.20, which means the correct expected Wh/Wh matchup must be at least 32.1, if not higher.  So it’s important to re-run your numbers before forming conclusions about bias.

Think about it:  how likely is it that the white/white matchup will be significantly different than expected, given that the expected rate for white pitchers is almost entirely determined by white umps in the first place?


#35          (see all posts) 2007/08/19 (Sun) @ 09:40

MGL/32, You’re definitely smart enough.  Either I’m wrong, or I didn’t explain it well.  Actually, I definitely didn’t explain it well.  I’ll think about it and try again sometime today.

MGL/33, I think all your significance level stuff is right.  It’s the adjustments that I worry about.  Not that they’re wrong—they’re a decent way to smooth out the data to get a “null hypothesis” of non-bias—but they’re not the *only* way.  Which was my point in 31.


#36          (see all posts) 2007/08/19 (Sun) @ 09:56

Okay, let me try this—it’s like umpires/batters, but in only one dimension.  Maybe just umpires.  But it should illustrate the point.

Suppose you go to the deep south in 1960 and ask people whether blacks should have the right to vote.  Here’s what you get:

Whites: 50%
Blacks: 100%

You want to test for bias.  So you figure (step 1) what’s the *expected* percentages you should get in the absence of bias?  Since there are 80% whites, and 20% blacks, the numbers come out to

Whites: 60%
Blacks: 60%

You then run a significance test (step 2).  What’s the chance that whites would have scored at only 50%, if their real chance was 60%?  You find it’s statistically significant.

In this case, the method seems to work: from what we know about race relations in the deep south in 1960, there IS bias.  And the test found it.

However, look back at step 1.  What we did was that we said the results “should” have been

Whites 60%
Blacks 60%

But from what we actually know about race relations in the deep south in 1960, if there had been no bias then, the REAL numbers would have been

Whites 100%
Blacks 100%

Because all the bias was on the part of the whites.

In this particular test, it still worked when we tested against 60%.  It probably would have worked against ANY numbers, because the bias is large.

However, it seems at first glance that it didn’t HAVE to work out that way.  It could have worked out that we found significance in one case, but not the other.  (Actually, I think in this one-dimensional case it always works out the same, and it’s only in the 3x3 baseball case that it doesn’t, but I’m not sure.  The point is, we really only have tested the 60%/60% case, but the 100%/100% case is the “real” situation.)

And I think that’s what’s happening in the baseball case.  If you calculate the “expected” strike calls the way you have, you get signficance.  If you adjust them the way I do, in my own post, you do NOT get signficance.

Which way is right?  We don’t know.  But I argue that for moral and statistical reasons, you have to give the benefit of the doubt to the umpires.  As long as we find at least ONE adjustment that is reasonable and shows no bias, you go with that one.


#37          (see all posts) 2007/08/19 (Sun) @ 10:07

Here’s a quick 2x2 example.

In a certain league, black umpires hate black pitchers.  In any other combination, there are 4K’s per game called.  But if it’s black/black, there are NO strikeouts called.  All the players and umpires know this is the case, and know about the feud between black players and black umpires.

The matrix looks like this:

4 4
4 0

Now a researcher comes along, who doesn’t actually know about the feud.  He calculates that the league average K rate is 3 (there are equal numbers of black and white on both sides of the matrix).  He then adjusts each cell to an “expected” number. 

For the white/white cell, he figures, “okay, the average is 3, but white umpires call 4, and white pitchers call 4.  So I take 3, multiply by 4/3, and multiply by 4/3 again.  The result is 5.3.  So white pitchers are expected to get 5.3 strikeouts against white umpires.”

He computes the full “expected” matrix to this:

5.3 2.7
2.7 1.3

And, admittedly, he probably finds significance.

But my point is, the above matrix is not a correct representation of what the matrix should look like under conditions of no bias.  As everybody associated with the league (but, alas, not our researcher) knows, it should look like this:

4.0 4.0
4.0 4.0

I think either case would give us significance, but can you at least imagine a case where it wouldn’t, where a guess at what the adjusted should be (first matrix) gives significance, while the unknowable real life adjusted matrix does not give significance?

I think that’s the case in the Hamermesh data, and I think you have to base your conclusions on the one that shows the least significance.  That is, since you don’t know which one is right, you choose the one that’s most favorable to the umpires—the one that’s most compatible with the null hypothesis.


#38          (see all posts) 2007/08/19 (Sun) @ 11:55

Guy continues the discussion at The Sports Economist in what I think is an excellent post:

http://www.haloscan.com/comments/skipsauer/6714267678478680720/?a=31792#432796


#39    MGL      (see all posts) 2007/08/19 (Sun) @ 15:07

Phil, O.K., I get your point now.  I’ll have to play around with the numbers to see if it makes a difference with numbers that are so close to one another.

In any case, I have to redo the numbers since I did the wrong adjustments as you pointed out (nice catch).

Also, as Guy pointed out, it is a little tricky to do the pitcher and batter adjustments since technically you should use the overall batter and pitchers rates when they are NOT facing the umpires you are adjusting.  I’m not sure it makes that much difference though (since you don’t want to adjust when the umpires are “causing” the adjustment).  The whole thing is a little trickier than I thought.  I would still not be surprised if there were SOME bias found in the data.  I really think that you should expect some.


#40    John Beamer      (see all posts) 2007/08/19 (Sun) @ 15:24

I would still not be surprised if there were SOME bias found in the data. I really think that you should expect some

It depends on how you define bias. If you define bias as a difference irrespective of statistical significance (however that is defined) then sure. However, to say that you would expect bias just because is wrong. In fact the null hypothesis has to be that there is NO bias, and the burden of proof is on the research to reject that. As we have discussed extensively that, I don’t think, has yet been reached.

The fact that batters have no bearing on umpire calls is slightly odd. Also the fact that no black/white bias exists also points to supporting the null hypothesis. If a small bias is present then judging by some of the numbers it isn’t worth worrying about, I don’t think.


#41    MGL      (see all posts) 2007/08/19 (Sun) @ 16:47

I agree that if there is bias, the numbers definitely suggest that it isn’t worth worrying about.

However, as I have said many times, many if not most of these kinds of empirical studies are exercises in Bayesian probability.  Given what we know or think we know a priori, what is the probability that the results support or don’t support some hypothesis.

Apparently (I am not familiar with the research of course) in other similar studies in other venues, researchers have found that when people deal with and make judgments or evaluations of same and different race people, that subconscious bias exists.  So it is not unreasonable to say, a priori, that there is x chance that bias exists and treat the data in a Bayesian fashion.  Trying to find racial bias in the data is A LOT different than trying to find bias among umpires based on the pitcher’s birthday.  In fact, for those of you that doubt my “Bayesian” argument, do you think that data suggesting racial bias and data suggesting bias by pitcher birthdays should be treated the same way?  If I flip a coin from my pocket 100 times and come up with statistically significant higher heads, am I supposed to conclude that the coin is likely biased?  Why not?  Because it is a BAYESIAN problem where the a priori probability of a biased coin in my pocket being near zero!  All of these studies are Bayesian problems!  It is NOT correct to assume the null hypothesis and try and “refute” it at some arbitrary sigma level.  That is only true when the null and alternative hypotheses are EQUALLY likely beforehand (a priori).  They may be in this case, but I don’t think so.


#42    MGL      (see all posts) 2007/08/20 (Mon) @ 01:25

Here I just looked at the differnces for the various combos between Qestec (Q) and non-Questec (NQ) parks.  I also added US Cellular Field to my list of Q parks, for a total of 11 Q parks, which is what the original authors used and what is listed in Wikipedia under Questec.

I think there is strong evidence of racial bias among minority umpires but not necessarily among white umpires.

In this analysis, I did not do any adjusting for pitchers and batters faced.  There were some fairly significant differences in the pool of pitchers and batters faced in the various combos, but in looking at them, they would not change the conclusions.

League-wide, there were around 1,124,941 called pitches.

31.90% of these were called strikes.

For home pitchers, it was 32.090%.

For visiting pitchers, it was 31.715%

The home pitchers have more called pitches, because they pitch the extra half inning more often.  570,010 called pitches were thrown by the home pitchers and 555,009 by the road pitchers (50.7% of all called pitches are by home pitchers).

Here is the basic overall data for the 2 Hispanic umpires, Hernandez and Marquez:

N pitches = 30,934
.5093 are from home pitchers (.5067 is average)

31.97 K%

Their average batter faced = 31.98
Their average pitcher faced = 31.95

They faced 74% white pitchers, 22% Hispanic pitchers, 2% black, and 2% Asian pitchers.  These percentages are by PA which is essentially the same as by pitches or by called pitches.  In other words, 74% of the PA with an Hispanic umpire behind the plate were with a white pitcher on the mound, etc.

Here is the basic overall data for the 4 black umpires, Meriwether, Diaz (actually Jamaican I think), Bucknor, and Danley.

N pitches = 61,929
.5055 are from home pitchers (.5067 is average)

31.57 K%

Their average batter faced = 31.86
Their average pitcher faced = 31.81

They faced 73% white pitchers, 22% Hispanic pitchers, 2% black, and 3% Asian pitchers, essentially the same distribution as the Hispanic umps.

For the white umpires, 93% (87 of 93) of all umpires, we expect their numbers to basically be equal to the overall numbers for umpires.

White umpires K% = 31.92
N= 1,032,156
Home % = .5067

Their average batter faced = 31.90
Their average pitcher faced = 31.91

They faced 72% white pitchers, 23% Hispanic pitchers, 2% black, and 3% Asian pitchers, also essentially the same distribution as the Hispanic and black umps.

Before I get to the various combinations of umpires and pitchers, here is some evidence that there is racial bias in non-Questec parks but not in Questec parks, as the authors of the original study found, or at least there is more bias in the non-Questec parks (if there is bias in both).

Overall, the K% in non-Questec parks is almost .1% higher than in Questec parks, as we would suspect (if you watch a lot of games, you will see that umpires “like” to call pitches strikes that are not strikes, if they can get away with it, which they are more likely to do in non-Q parks).  The numbers are 31.94 in NQ parks and 31.85 in Q parks.

For white umpires, their K% is 31.84 in Q parks and 31.97 in NQ parks, around the overall average as you would expect.  Keep in mind that 72% of the time, white umpires face white pitchers.  So if they were biased in favor of white pitchers at least in NQ parks, we would expect to see this bias show up even we look at ALL of their calls.

For black umpires and Hispanic umpires, they usually face different race pitchers.  98% for black umpires and 78% for Hispanic umpires.  We find that black and Hispanic umpires have a higher K% in Q parks, the exact opposite of the white umpires. 

Black umpires have a K% of 31.66 in Q parks and 31.53 in NQ parks.  Hispanic umps are 32.23 in Q parks and 31.80 in NQ parks.

If we buy the premise that umpires are much less likely to be biased in parks where they are being monitored, then the Q parks become “control parks” for purposes of this study.

If this is true, that there is a racial bias effect in the NQ parks only, or at least more of a bias in the NQ parks than the Q parks, we should see a larger effect if we look at W/W only, B/non-B only (although all black umpires are mostly B/non-B anyway), and H/non-H only, in the Q and NQ parks.

For the W/W combo, in Q parks, the K% is 32.03 and in NQ parks, it is 32.13.  So we DON’T see a larger effect as we expected.

In fact, for the W/non-W combo, we see 31.38 in Q parks and 31.54 in NQ parks.  Basically no difference than the W/W combo, suggesting that if there is no bias in Q parks, there is also no bias in NQ parks, for white umpires.

Let’s look at B/non-B and H/non-H combos in Q and NQ parks.  For the B/non-B combo, we should see essentially the same thing as with black umpires overall, as 98% of their pitchers faced are non-black anyway.

For the H/non-H combo though, we should see stronger evidence of bias than for Hispanic umpires overall.  Indeed the B/non-B combo had around the same numbers as the black umpires overall, 31.65 in Q parks and 31.55 in NQ parks. 

The H/non-H combo had a bigger effect than the overall Hispanic umpire numbers, suggesting that there is a significant bias in non-Q parks for Hispanic umpires.  The numbers for the H/non-H combo were 32.66 in the Q parks and 31.53 in the NQ parks.

So looking at differences between Q and NQ parks, we definitely find evidence of racial bias among Hispanic and black umpires, but not among white umpires.


#43    Pizza Cutter      (see all posts) 2007/08/20 (Mon) @ 03:28

Oh sure, I wander off to San Francisco for a week (American Psych Assn convention) and this pops up.

I have a few tiny critiques of the method used in the original paper (pitching wins used as an outcome measure!), but the biggest one is that the authors never control for the pitcher.  A simple fixed effect for pitcher would do nicely.

For some reason, I haven’t heard anyone use the words “chi-square” in this discussion.  Using the data that MGL posted in #24 (excellent adjustments MGL!) for actual and expected called strike counts, I calculated a chi-square, although only for White, Black, and Hispanic combos (9 possible combos of pitcher/ump).

Overall, the chi-square value is 10.656 with 8 degrees of freedom.  P-value is .22, meaning no significant findings overall.

To get a little more fine-grained, I re-ran it for just the cells involving white umpires, then just Hispanic umpires, then…

White umpires had a chi-square of 7.094 (8 degrees of freedom), which had a P-value of .028, which is significant.  White pitchers get more strike calls than expected from White umpires; Black and Hispanic pitchers get less, although the effect was much more pronounced for Hispanic pitchers.

There’s also a marginal effect for Hispanic pitchers (p = .0523).  Hispanic pitchers get shafted by both White and Black umpires, while when judged by Hispanic umpires, they about break even.

The effects are fairly small, statistically, but significant (i.e., not zero) nonetheless.  The rest is a moral question of how big an effect you want until it makes a difference to you.


#44    John Beamer      (see all posts) 2007/08/20 (Mon) @ 07:54

Pizza—I think you need to do the Chi Square on the original data and not on MGL’s adjustments in #24, as those adjustments weren’t quite right—see posts above.

I agree with MGL that the data mostly points to bias from minority umps. Actually this trend can be discerned from figure 1 in the original study and the resulting regressions, although I still think there are potential biases in the sample data given the non-random nature of Questec parks and also the very few Hispanic umpires (where the bias seems to originate from).


#45    Guy      (see all posts) 2007/08/20 (Mon) @ 09:53

MGL:  I wasn’t arguing that to calculate expected percentages you must rely on pitcher performance outside the category currently being analyzed.  That might make sense if you had 20 different racial/ethnic groups, but here you would have results like basing black pitchers’ “expected K%” only on black and Hispanic umps when trying to measure white ump bias, dramatically shrinking your sample.  I think you have to use their overall rates.

However, I do think doing the adjustments correctly will make a real difference.  Most likely, white pitchers’ expected rate will be higher, meaning there is no white/white bias. 

In post 42, do you think the Hisp/Hisp matchup could be affected by the dsitribution of pitchers, hitters, and/or park?  The sample size can’t be that large, and a concentration of NL games alone in non-Q parks could raise the K%.


#46    tangotiger      (see all posts) 2007/08/20 (Mon) @ 10:17

Pizza/43:

Wouldn’t you have to do the chi on all 9 cells, and not at all on the 3-cell set?  After all, if the Hispanic pitchers are dominated by wild throwers, then all you are showing with the white umpire set is that fact.  (Imagine if you will that you would do the “white umpires against each team”, and conclude that white umpires are biased against the Reds and toward the Padres.) The reason you need all 9 combinations is that it controls for that fact (sample size notwithstanding).

So, I’m sticking with your p=.22.


#47    MGL      (see all posts) 2007/08/20 (Mon) @ 10:59

My batter and pitcher adjustments were wrong, as Phil pointed out (I took the average of the two rather than adjusting for each separately) so I have to redo that.  Plus it is not clear what to use to adjust since the batter and pitcher overall numbers are affected by bias, if it exists, in the firsr place, and we really want to do the adjustments with unbiased pitcher and batters numbers.  I’ll post some more numbers later.  It does “look” to me, though, that if there is a bias it is by minority umps and not white umps. 

I still have not received a reply from the authors about who their 5th black and 3rd Hispanic umpire is.  Any ideas from you guys?


#48    Pizza Cutter      (see all posts) 2007/08/20 (Mon) @ 13:25

I see the error in MGL’s adjustments, although I think they’re a fair sight prettier than the original data.  The methodological problem is that the White/White cell is overwhelming absolutely everything here.  I don’t have the data set, but the way around that would seem to be to equalize the distributions by taking a random sample of the over-represented ones.

Tango/46, it’s a little bit more complex than that.  The White umpires were the only ones that showed any sub-group effects, although the Hispanic pitchers showing an effect suggests that they may indeed be more wild.  Sub-group analyses are more volatile due to the fact that you have less information available and in this case, we have two factors which depend on each other.  The .22 overall number is going to be more stable, and is probably the better number with which to summarize the data.  But sometimes things shake out in the aggregate while hiding real effects in the details.


#49    tangotiger      (see all posts) 2007/08/20 (Mon) @ 14:08

Dan Fox posts some Questec numbers:
http://danagonistes.blogspot.com/2007/08/umpires-and-questec.html

I have a comment in the comments section of that link that specifies to be careful with what you see.


#50    John Beamer      (see all posts) 2007/08/20 (Mon) @ 16:03

I’ve been reading this link here where Hamermesh gives a bit of commentary, which I find interesting, if slightly sloppy.

The link is: http://www.msnbc.msn.com/id/20252500/

For instance Hamermesh said:

The umpires hate those [QuesTec] systems,” Hamermesh said. “When you’re going to be watched and have to pay more attention, you don’t subconsciously favor people like yourself. When discrimination has a price, you don’t observe it as much.

And this

We all have these subconscious preferences for our own group

The big issue with this study is that if any effect exists it is with the Hispanic/ Asian sample data, where the data is more patchy and less likely to be robust. I dunno ... but I don’t feel comfortable drawing conclusions about racial bias when the biggest group (white umps/ pitchers) show no bias, and the group most traditionally discriminated agains (black umps/ picthers) show no bias. In the group that does show most bias there are only a couple of umps! Unless someone can come up with a convincing explanation I think we have to say that while there may be some racial bias is some sectors of the umpiring population it is at best marginal.

Someone needs to do this study for other time periods and see if the same conclusions are drawn


#51    Pizza Cutter      (see all posts) 2007/08/20 (Mon) @ 16:35

John/50 - Hamermesh is arguing there from a psychological POV.  There is a lot of evidence from other settings about subconscious in-group bias effects.  For example, if you assign people randomly to groups, even if you tell them it’s completely random assignment, they still show subtle preferences to people in their group, although they all deny it.  It’s a very robust finding.

The biggest problem is that we don’t have an independent third arbiter of whether the pitch was actually a ball or a strike (QuesTec data itself?  Pitch f/x?) We’re relying on the umpries for our data on what a true ball/strike is, while at the same time making the argument that they may be biased themselves!


#52    tangotiger      (see all posts) 2007/08/20 (Mon) @ 17:08

I think Dan Fox, John Walsh et al can easily answer Pizza/51.

And, as I pointed out earlier, even if there was a large bias, it would not necessarily be a race-bias, since it doesn’t look like the authors introduced a parameter for the pitchers and umpires.  All that it would show, if there was significance, is that one to many of the umpires in the subgroup have a bias.  And, if one umpire makes up 20% or 33% of a group, and you don’t control for the identity, then this is out-and-out racism (painting a group with a single brush for the bias of a few, or one).

Did I miss the part in the paper where the identities of the umps/pitchers were controlled for?  Imagine if this study was done in the 1997 playoffs.  The black ump / hispanic pitcher bias would be off the charts.


#53    MGL      (see all posts) 2007/08/20 (Mon) @ 18:20

I agree that even if there were some highly significant differences, when you have a study where you are getting data on 2 or 3 persons (the Hispanic umpires) or even 4 or 5 (the black ones), to generalize about “racial bias” would be ludicrous.

For example, to anyone that watches a lot of games, Angel Hernandez, one of the only 2 Hispanic umpires in my sample, and one of only 3 in their sample, is known as somewhat of a “nut case” of an umpire.


#54    Guy      (see all posts) 2007/08/20 (Mon) @ 19:12

Tango/52: 
In some regressions they use fixed effects to control for individual pitchers, and in a few (like the Questec regressions) they do the same for individual umps. But it doesn’t appear to be consistent.

I think there’s an even more basic problem, which is failing to control for pitcher’s (or umpire’s) race.  As a result, there same-race variable becomes a rough proxy for pitcher=white, since pitcher is white in 98% of same-race matchs.  I suspect that if they simply controlled for pitcher’s race, then the pitcher-ump match impact would largely disappear.


#55    MGL      (see all posts) 2007/08/20 (Mon) @ 20:57

Here is some more data.  To jump the gun, my final conclusion is that there is not any significant evidence of bias among any combinations of umpires/pitchers other than with Hispanic umpires and Hispanic and non-Hispanic pitchers. Given that there are only 2 Hispanic umpires in my sample of umpires (all umpires in 04-06) I would definitely hesitate to make any generalizations about racial bias.

Here are some data for various combinations of umpires and pitchers, in all parks, Q parks, and NQ parks:

Same race

All parks:

K%=32.09
N=750,186
Batters: 31.92
Pitchers: 32.08

Q parks:

K%=32.02
N=274,098
Batters: 31.89
Pitchers: 32.15

NQ parks:

K%=32.13
N=476,088
Batters: 31.94
Pitchers: 32.04

Different race

All parks:

K%=31.53
N=374,796
Batters: 31.88
Pitchers: 31.55

Q parks:

K%=31.51
N=140,483
Batters: 31.88
Pitchers: 31.55

NQ parks:

K%=32.54
N=234,313
Batters: 31.88
Pitchers: 31.55

Now, since the same group is mostly comprised of white pitchers, we see that the pool of pitchers is 32.08, about that of white pitchers.  The pitchers in the different group has lots of minority pitchers, and as you can see, those pitchers are an overall of 31.55.  If we adjust for pitchers in each “cell” we get virtually the same K% for same and different combos.  Of course, it could be that the reason that the minority pitchers have low K% is that they face mostly white umps.  In other words, it could be that their “real” K rates are the same as white pitchers, but for the fact that they face mostly white, different race umpires.  So from the above data, I don’t think we can tell whether there is any overall bias.

If we break these cells down even further, here is what we get:

W/W

Q parks:

K%=32.03
N=271,040
Batters: 31.89
Pitchers: 32.16

NQ parks:

K%=32.13
N=471,094
Batters: 31.94
Pitchers: 32.04

W/N-W

Q parks:

K%=31.38
N=110,173
Batters: 31.87
Pitchers: 31.43

NQ parks:

K%=31.54
N=179,812
Batters: 31.89
Pitchers: 31.46

B/B

Q parks:

K%=32.83
N=264
Batters: 32.35
Pitchers: 32.55

NQ parks:

K%=32.13
N=980
Batters: 32.15
Pitchers: 31.60

B/N-B

Q parks:

K%=31.65
N=20,636
Batters: 31.90
Pitchers: 31.86

NQ parks:

K%=31.55
N=40,049
Batters: 31.82
Pitchers: 31.79

H/H

Q parks:

K%=30.76
N=2,794
Batters: 31.80
Pitchers: 31.17

NQ parks:

K%=32.79
N=4,014
Batters: 31.96
Pitchers: 31.79

H/N-H

Q parks:

K%=32.66
N=9,674
Batters: 31.94
Pitchers: 32.20

NQ parks:

K%=31.53
N=14,452
Batters: 32.04
Pitchers: 31.97

Without going through each of the above combos, it looks to me like the only significant difference is among Hispanic umpires, and there are only two of them in my sample of umpires, in NQ parks.  The difference, after adjusting for batters and pitchers (but not parks) faced, is around .7% in K%, which is more than 2 standard errors above what you would expect if there were no bias.  I don’t find much if any bias among any other groups.  And given that there are only 2 Hispanic umps in the data, I would hesitate to draw ANY conclusions about race with respect to Hispanic umpires or any other race/ethnicity of umpires.


#56    MGL      (see all posts) 2007/08/20 (Mon) @ 23:19

The article that Tango refers to on MSNBC is filled with misleading information.  To wit:

Hamermesh added that even a slight bias by umpires will affect the kinds of pitches that pitchers make if they believe they are getting squeezed by the umps. Pitchers who are getting balls called too much might start throwing over the middle of the plate more, thus resulting in batters getting fat pitches to hit, Time said.

How does he know that?  I doubt that the kind of differences that the researchers found (if they are even “real") are in any way shape or form noticed by pitchers.  Certainly not to the extent that they are going to significantly alter their pitching styles.  The above snippet makes it seem as if a pitcher who notices one “extra” ball in 300 is going to start throwing pitches right down the middle.  That is ridiculous.

“I expect that [MLB] will not be very happy about this, but the fact that with a little bit of effort this kind of behavior can be altered, that’s very gratifying.

I’ve read that a few times.  How is he proposing that this “behavior” be altered?  Show the umpires the study and tell them to stop being biased?  If it is subconscious, you may not even be able to alter it.  I guess he means install Questec in all the parks. That is not an unreasonable idea in any case.  The whole point of Questec is two-fold.  One, to evaluate umpires, and two, to encourage them to call a more accurate and uniform strike zone.  If umpires are aware of which parks they are not in, that kind of partially defeats the purpose on both ends.

“One pitch called the other way affects things a lot,” Hamermesh said. “Baseball is a very closely played game.”

That is a meaningless and perhaps stupid statement without any context.  First of all, what does, “Baseball is a very closely played game” mean? I guess that is a variation on, “Baseball is a game of inches.” More importantly, “one pitch called the other way” per how many pitches?  The differences they found, in fact, as several posters here have pointed out, make very little difference in terms of runs scored and the outcome of the game.

They found that the lowest percentage of strikes were called when the pitcher was black and the umpire was white, Time said.

Uh, that does NOT mean that white umpires are biased against black pitchers, which that statement clearly implies.  You would have to look at all the numbers as well as the sample sizes (to see if the numbers have any “significance").  In fact, in my data, and I think in theirs, little or NO bias was found in white umpires’ calls.

I am starting to think that the standards in journalism, even in the mainstream media, is pitifully low.


#57    Guy      (see all posts) 2007/08/21 (Tue) @ 08:02

"They found that the lowest percentage of strikes were called when the pitcher was black and the umpire was white,” Time said.

As you say, this is not evidence of bias.  But it’s worse than that:  in fact, the lowest called strike rates in their data are when Asian pitchers face Black or Hispanic umpires.  You have to wonder if he honestly missed such an obvious fact in his own data, or if he focused on the black/white comparison because it fits the usual narrative of racial bias better than a story that Black and Hispanic umps discriminate against Asian pitchers.


#58    tangotiger      (see all posts) 2007/08/21 (Tue) @ 08:10

I wouldn’t be surprised if there is the most strike calls with oversized pitchers, especially if those pitchers are CC Sabathia and David Wells.  Or the most strike calls with lefthanded 6’10” pitchers. 

The population of each sub group, in my examples and in the paper, is not the same.


#59    John Beamer      (see all posts) 2007/08/22 (Wed) @ 16:46

I think there’s an even more basic problem, which is failing to control for pitcher’s (or umpire’s) race.  As a result, there same-race variable becomes a rough proxy for pitcher=white, since pitcher is white in 98% of same-race matchs.

But in the regressions where the pitcher and umpire fixed effects are controlled for then this is effectively controlled for too. No?

However, the fact that pithcer and umpire weren’t controlled for in Table 3 is indefensible.


#60    Guy      (see all posts) 2007/08/22 (Wed) @ 23:46

I don’t think fixed effects would control for pitcher race, though I could be wrong about that.  Regression can’t “know” if white pitchers have a higher K% because of their race or some other characteristics.  So if you ran a regression with just pitcher race and fixed effects, I assume it would show race as a signif. predictor of K%.  Here, the authors include a same-race matchup variable, which is hugely correlated with pitcher=white (while in the mixed-race sample pitchers are predominantly Hispanic), but don’t control for pitcher race. 

If you look at table 2, you can see that white and Black pitchers both enjoy a same-race advantage of about .0015, while for Hispanic pitchers it’s about .0035.  Yet the authors conclude that the overall advantage is .0034.  I don’t see how you get there except by failing to account for white pitchers’ higher K%.

Perhaps the authors made the assumption that pitchers of all races would have the same K%, absent umpire bias.  If so, that assumption should be made explicit.


#61    John Beamer      (see all posts) 2007/08/23 (Thu) @ 00:58

So if you ran a regression with just pitcher race and fixed effects, I assume it would show race as a signif. predictor of K%.

I may be skating on thin statistical ice here but my understanding of what the authors have done with the pitcher fixed effects is control for each pitcher.

So in your example of a regression of pitcher race with pitcher fixed effects I think the pitcher race variable would be zero/insignificant because all the variance in the data is captured through the pitcher fixed effect variable. The R^2 of this regression would be 1 because effectively all pitchers have a seperate variable controlling for them.

Agreed on table 2 ... that is a mystery. But that 0.0034 doesn’t include pitcher or umpire fixed effect. Some of the later regressions eg, Questec (sample size issues here though) and terminal count (relievers may well have different k%—though pitcher fixed effect should control for this) do.


#62    Guy      (see all posts) 2007/08/23 (Thu) @ 08:12

But John, if that were true then no variable in a fix effects model would ever be significant.  The fixed effect variable will assign a coefficient to “Brandon Webb” that should represent his unique K%, but controlling for other factors in the model.  In my example, it would be his K%, controlling for his “whiteness.”

Also, the R^2 certainly would not be 1, because the model is trying to determine the outcome of each pitch.  In fact, if you look at their tables you’ll see the R^2 is always quite low, as you’d expect.  After all, the outcome on any given pitch is highly random.

That said, my knowledge of fixed effects regression is definitely limited, and others should weigh in.


#63    John Beamer      (see all posts) 2007/08/23 (Thu) @ 10:49

if that were true then no variable in a fix effects model would ever be significant

Agree other should weigh in on this as I am not 100% sure of my groun on this but it depends on what you are measuring.

Suppose you are running a regression of pitcher ERA vs k%, nothing more. If you control for pitcher fixed effects (which I interpret to be a dummy variable for each pitcher in the sample) then the R^2 of the regression is 1 beacuse each dummy variable basically represents the ERA of each pitcher. Hence R^2 = 1.

In the example we are talking about above the dependent variable would be pitcher K% and the independent variable would be pitcher race + fixed effects. Suppose there are 100 pitchers in our sample then that is 100 fixed effect (dummy) variables. The race variable would be zero/insignificant as all the difference in K% would be captured by fixed effects.

Now the regressions that are being run by Hamermesh et al are a lot more complex. You are trying to work out the difference in K rates among pitcher when different/same race umpires are at the plate. The pitcher fixed effect controls for every’s pitchers k-rate independent of the umpire, but you should still see an impact on the UMP variable.

If you look at table 3 or 4 you will see that the number of fixed effects controlled for is in the 1000s.


#64          (see all posts) 2007/08/23 (Thu) @ 11:03

I think what’s happening is that the fixed effects are absolutely controlling for the K% of each pitcher.  But what’s being measured is not that K%, but *each individaul pitch*.  And knowing that Roger Clemens is on the mound still doesn’t tell you whether his 27th pitch will be a strike.

So what this regression can tell you is that Clemens (or actually, all pitchers, adjusted for their own strikeout rates) are more likely to call a strike when the umpire is his own race.  And since he’s already been adjusted for *his own* K-rate, his race is irrelevant.

So I think the method is fine, but I still don’t understand why the results came out significant.  One idea I’m toying with is that the regression (3d) assumed the probabilities are additive.  If they’re multiplicative (as seems more likely), maybe that could cause a false positive?  That is, suppose white pitchers call 10% more strikes than black, but the study measures it as “one more per three innings” instead of 10%.  Clemens should “really” go from 50 per game to 55 per game, but the study has him go from 50 to 53.  Therefore, it looks like white umpires are favoring him.

The 3d “LPM” column didn’t use probits or logits or whatever, just a regular regression on %.  And since white pitchers throw more strikes than black, this could have made the difference.

Remember, the result was only significant to exactly 2 SD.  And, the 3x3 chart did show *some* same-race effect.  So the additive/multiplicative issue maybe could have changed the significance from 1.6 SD to 2.0 SD, or something.

It’s too bad the author didn’t use probits for (3d) like he did for (1d) or (2d), because then we’d know.


#65    John Beamer      (see all posts) 2007/08/23 (Thu) @ 12:00

Phil

Can you enlighten me at all as to why probit is better for this rather than a regular regression? I assume it is because the dependent variable is a binary outcome (called strike or not). Right, that makes sense.

SO, what I don’t understand is why did they switch to a normal regression? It doesn’t make sense!


#66    John Beamer      (see all posts) 2007/08/23 (Thu) @ 12:04

One idea I’m toying with is that the regression (3d) assumed the probabilities are additive.  If they’re multiplicative (as seems more likely), maybe that could cause a false positive?

Phil—Why would the probabilities be multiplicative?


#67          (see all posts) 2007/08/23 (Thu) @ 12:33

What I meant to say was that if an umpire increases the strike percentage somehow, it probably does it in a way that involves multiplication.  That is, perhaps white/white turns (say) 0.03% of balls into strikes, NOT that white/white turns (say) one ball into a strike per X pitches.

Suppose a certain ump turns 10% of balls into strikes.  That means a Clemens who throws 50% strikes would go up to 55%, but a bad pitcher who throws 30% strikes would go to 37%. 

A regression that does this linearly would credit both pitchers with a