THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Wednesday, September 15, 2010

Either a horrible study or terrible journalism…

By , 10:32 PM

From this WSJ article, courtesy of Rob Neyer:

Erin Smith, a Ph.D. candidate at NYU’s Stern School of Business, co-wrote a paper that shows a 48% increase in a team’s attendance makes that team score an extra run every game.


#1          (see all posts) 2010/09/15 (Wed) @ 22:42

Also from the WSJ article:

“Smith’s study, which was published this year in the Journal of Quantitative Analysis in Sports, used a regression model to account for things like team ability, stadium size and weather.

By doing this, she showed that increased attendance does, in fact, help teams play well, instead of this simply being a matter of good teams drawing more fans.”

Can a regression distinguish the direction of the causality without doing some kind of longitudinal analysis?


#2          (see all posts) 2010/09/15 (Wed) @ 22:58

Here is a link to the actual paper, courtesy of someone at BBTF:

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1495195


#3          (see all posts) 2010/09/15 (Wed) @ 22:59

And a fascinating comment from JC, also from the BBTF thread:

“2-3 years ago, I refereed this paper for another journal, which rejected the paper on my recommendation. I agree that the paper is flawed, but not necessarily for the reasons discussed above. They claim to tackle the correlation/causation problem (duh), but I’m not convinced that they did so properly. In fact, by the end of the paper I wasn’t really sure what the authors were doing, and I wouldn’t be surprised if the results were fabricated. Now, that’s a strong charge, why do I feel that way? Well, the authors lied about doing background research that they did not do, crediting me for research that I never did. If you take a look at the paper (http://www.bepress.com/jqas/vol6/iss1/4/), the erratum should stand out. When I busted them on it, one author lied to me as how this occurred (not realizing that I was the referee who had previously discovered this). Yet, even after being warned to remove the erroneous information from the paper by me and another editor, the authors simply passed the paper along to another journal. I was personally told by the JQAS editor that the paper was rejected, which is the proper thing to do when academic dishonesty is discovered. I put in an e-mail to the editor asking why the paper is still “published,” but he hasn’t gotten back to me.”


#4    Andy      (see all posts) 2010/09/15 (Wed) @ 23:00

The paper is available here, although it may be gated: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1495195

Here is the relevant statement from the abstract: “Our results indicate that a one standard deviation increase in attendance results in a 4% increase in the likelihood of a home team win.”

The authors (it is unclear why just Smith is credited in the WSJ) address the causality issue through a standard econometric technique of instrumental variables, which attempts to find a ‘random’ event that moves attendance without affecting home field advantage except through the attendance effect.

The validity of instruments is always up for debate, but the authors use several: day of the week, game time, game day promotions, and weather. These will obviously influence attendance but their impact on home field advantage is a less certain. Having multiple of these instruments helps a lot too.


#5    Andy      (see all posts) 2010/09/15 (Wed) @ 23:08

I should add that I find the paper unconvincing. What popped out at me after a decent skim is the lack of substantial (or any!) robustness checks. It is standard operating procedure in these sorts of papers to show multiple, slightly different specifications and confirm that results look broadly similar. Otherwise people worry that the presented tables are cherry-picked. That is, the authors ran 1000 regressions and reported the one with the best looking results. I am not accusing these authors of that, but I am worried about such deviations from best practices. With such strong and striking results, it is incumbent on authors to convince their readers that they have found a robust fact.

The pricing stuff is also not very convincing, but that’s beside the point right here.


#6    minesweeper      (see all posts) 2010/09/15 (Wed) @ 23:26

Yet again, an academic paper somehow manages to fail the sniff test at virtually every level, even when the authors have been found guilty of academic dishonesty.

I mean, such an outrageous claim should require a book’s worth of explication and analysis.  Not 16 pages double-spaced.


#7    Andy      (see all posts) 2010/09/15 (Wed) @ 23:32

I would hardly call this academic dishonesty. A lousy literature review, yes, and maybe sketchy covering their asses. But the errata has nothing to do with the content of the paper. Nothing. They just whiffed on a summary. Not a big deal. I would be flabbergasted if that were the reason for any journal rejection.


#8          (see all posts) 2010/09/16 (Thu) @ 16:11

If I had run this study and found the results they found, I would have assumed I screwed up the regression. Regressions are dangerous things when we forget to logically consider what we’re doing. I didn’t read the paper (and probably won’t), but I assume the null hypothesis is that attendance has no effect on outcome. And the finding is that it does. I think that passes the smell test. Then they go on to say that a 48% increase in attendance equals 1 run of additional offense (no comment on defense). To make a lazy point, should we expect the Marlins to score a dozen of run per game if their fans actually showed up? After all they could have their attendance increase by 48% several times. And if the Phillies who have a maxed out attendance add Joe Mauer and Ryan Braun, should we not expect them to score more.

I understand the flaws in those examples, and the problem is probably mostly in the reported sentence ("makes"? really!?), but it’s not passing my internal smell test. I wouldn’t have ever thought to look for it because it just ‘seems’ quacky. Maybe I’m succombing to some kind of group think…

Am I missing something? Can someone explain how this could pass a smell test please?


#9    Depot      (see all posts) 2010/09/16 (Thu) @ 17:54

MGL (1): I would actually consider this longitudinal analysis since they include ballpark fixed effects.  Identification, then, comes from within-ballpark variation in attendance.  And they use instruments to show this variation in (supposedly) exogenous ways.  Having more years would be nice, but you’d probably want to look within-ballpark-year anyway so they’re just losing power this way. 

A-Team (8): I think the motivation is fine.  Does more attendance lead to better performance?  I would believe an effect. 

Andy (4): I’ve heard this point made before, but it’s actually wrong.  All of your instruments have to be exogenous.  You can’t just include a ton of instruments and hope one of them is exogenous.  Your assumption when using 2SLS is that all the instruments are uncorrelated with the error term.  I think that’s the flaw of the paper.  You have to believe that adverse weather doesn’t independently affect home field advantage.  _And_ day of the week (troubling since don’t starters tend to get rested on Mon/Thurs?  This might vary based on whether you’re home or away). 

I guess, overall, I like the attempt and the instruments aren’t horrendous, but I wish there were a better instrument.  I can’t think of one off the top of my head though.


#10    MGL      (see all posts) 2010/09/16 (Thu) @ 18:14

Just the fact that the result of this study is CLEARLY wrong and no one is able to tell exactly what went wrong is reason #13 why I hate complex regressions for studies like this…


#11          (see all posts) 2010/09/16 (Thu) @ 18:48

I have not (nor will I) read the paper.  According to the excerpt from the abstract (Andy #4):  “Our results indicate that a one standard deviation increase in attendance results in a 4% increase in the likelihood of a home team win.  Are they suggesting that the increase in attendance *cause* the increase in likelihood of a home team win?  If so, I would have thought the cause and effect are reversed.


#12          (see all posts) 2010/09/16 (Thu) @ 19:24

MGL/10,

I don’t know how clearly wrong it is, but I think the interpretation (or the regression used) could be incorrectly applied.

I skimmed the paper, so I didn’t catch it, but I assume they used a probit model with the IV approach.  If they did, then the value on the coefficient of attendance the further away from the mean attendance level--and closer to the bounds of the probability of winning--would diminish (for example, the Marlins adding 40,000 fans would not just be 40,000 times the increase reported).  So either they reported it in the abstract in a way to catch attention, or they didn’t specify the regression correctly. 

As for the runs per game, I would hope they used a censored regression or tobit.  If they did, then the interpretation of the coefficient reported should be similar to a probit model, just applied to Runs instead of probabilities.  So as you increase the runs with attendance, the relationship gets smaller and smaller.

When reporting the coefficients on a probit or tobit regression, you’re either reporting the effect at the mean or the average effect across all values of the independent variable.  Not sure which they report there, but keep that in mind when assessing the ridiculousness of the claim for the coefficient on attendance.  That doesn’t mean I think it’s totally valid, but something to remember.

I think the idea that the relationship is cause-effect from number of fans to winning is a bit of a stretch.  But that’s mainly because I’m not a huge fan of IV approaches.  One instrument used (namely, weather) seems like a reasonable choice, and any correlation it may have with the error would seem to be awfully small (though, you can make a career out of showing how IVs being used are completely useless).  I’m not a fan of using day of the week, especially the way off-days are scheduled in MLB.


#13    tangotiger      (see all posts) 2010/09/16 (Thu) @ 19:45

Millsy, I think MGL is saying that any time you get a finding of this magnitude, it’s almost certainly wrong.

This is what the abstract says:

Our results indicate that a one
standard deviation increase in attendance results in a 4% increase in the likelihood of a home team
win. We also find that if attendance as a percent of stadium capacity were to increase by 48%, we
would expect the home team’s run differential to increase by one run. We show that the additional
home-field advantage is driven by increased home team performance.

One SD in attendance is 25%.  “Results in” is a causal effect.

Anyway, when I’ve done my testing in the past, it was that a 2% increase in attendance was linked to 1 extra win per season.  So, a 25% increase would be linked to 12.5 extra wins, or +.08 wins per game.  But the causal link is not necessarily attendance to wins, but almost certainly wins to attendance.

2% increase in attendance means that if you take in 100MM$ in revenue from attendance and TV, you get 2MM$ extra.  And 2MM$ = 1 win is totally believable.

But, it’s the claim of causal link that is the problem.  It goes the other way: winning breeds fans.  D’uh.

***

And it’s ludicrous to state that attendance causes run differential.  They said that the one SD of attendance (+25%) causes (really linked to) 0.64 runs per game.  We know that 0.64 runs per game is +.064 wins per game.

Well, I just told you my off the cuff estimate was .08 wins per game.

Again, it’s the causal relationship that is not there.... it goes the other way!  The more you beat your opponent in scoring, the more you win, and the more fans show up.

CORRELATION IS NOT CAUSATION.

Isn’t that the oath that all researchers have to take?


#14          (see all posts) 2010/09/16 (Thu) @ 19:58

Tango,

I’m generally agreeing, I was just trying to point out that the way it is phrased may not be exactly how it should be interpreted, even if that finding were ‘correct’.  You guys are much more familiar with run values and how that may lead to a win than I, but even I would agree that the results seem quite extreme...and the idea that the effect runs the way it is inferred is pretty questionable.


#15    tangotiger      (see all posts) 2010/09/16 (Thu) @ 20:07

Right, I can agree with that, that the researchers did properly report the findings and that +48% attendance = +1 runs per game.

The problem is the interpretation.  This is where a subject matter expert would come in handy.  If you ask any baseball schmoe and say: “Hey!  Whitesox attendance rose by almost 50% and their win% went from .500 to .600!  What do you make of it?”

Undoubtedbly, 100% of those baseball fans would say: “Well, put a winning team out there, and the fans will come!”

I mean, if I had run a poll asking people: “How much would you think attendance would increase if your .500 team would start playing .600?” I think 50% would sound just about right.

There’s a good article a year or two ago studying the relationship of attendance to wins (at a team by team level) in Birnbaum’s By The Numbers.  Hopefully, Phil will look at that study, and this one, and be able to easily reconcile the two, but come up with our interpretation here, and not that of the researchers.


#16    Depot      (see all posts) 2010/09/16 (Thu) @ 20:20

Yeah, it’s hard to defend a study that I don’t agree with, but I think the overall idea is pretty valuable.  This seems like a perfect instance when you do need to use complicated regression techniques.  At the very least, you have to use IV/2SLS with fixed effects.  The whole point of the study is to address the correlation/causation question.  That’s really the point of IV.  Millsy’s point - which I think is a good one - is that the point estimates might actually be more reasonable than they sound.  So, we might as well discuss the methodology instead of dismissing the results simply because they don’t mesh with our unverified beliefs.  If you wanted to see if attendance had an effect on runs or winning %, how would you do it?  I think this paper is a decent first step.  That’s kind of an insult to the paper, but I do think it deserves some credit.  For example, I like the idea of using promotions (and promotions only) as an instrument since I don’t think teams tend to alter their strategies based on promotions.


#17    Andy      (see all posts) 2010/09/16 (Thu) @ 20:28

@Tango #13, I cannot stress enough that the *entire* point of this paper was to address the “correlation not causation” point.

They 100% know that some causation runs the other direction.  The question is, can we use statistical techniques and some tricks of the world such as the fact that weather influences attendance but does not obviously influence home field advantage, to evaluate whether there is any causation running from attendance to runs or wins. Their results suggest yes.

@Depot, no instrument except actual randomization is perfect. The question is how bad is the ones we’re stuck with. Here is where my point about showing the results for different specifications (ie different instruments included/excluded) becomes important in evaluating the instrumental validity.


#18    MGL      (see all posts) 2010/09/16 (Thu) @ 20:33

Of course I have no problem with the magnitude of the correlation.  It is the direction that is absurd.  There is ZERO chance that if you increased attendance at a stadium by 48% (of capacity I guess), that the expected runs scored of the home team would increase by 1 (as a result of the attendance gain).  Zero chance.  Not 1% chance.  Zero chance.

With all the very smart statisticians that contribute to this site, why is there any controversy over whether they state or imply causation rather than just correlation.  Do they or do they not?  Does their method get to the direction of the correlation or not?  Depot above in #9 said that it did.  Are they correct?


#19    Andy      (see all posts) 2010/09/16 (Thu) @ 20:33

Millsy #12 is spot on with probit analysis. Doing large out-of-sample forecasting is almost always a terrible idea in regressions, and is especially problematic in non-linear models (like probability models).

It is unfortunate that we can often only identify average effects of interest, but that is often a fact of life. To do more you typically need to impose a lot more structure or theory on the data generating process. For instance, the work done on run values within games is a great example of how good theory informs good data work. It’s not clear what the proper theory/structure to think about how attendance influences wins is, so we’re kind of stuck with averages.


#20    Andy      (see all posts) 2010/09/16 (Thu) @ 20:38

MGL, They are estimating causation, running from attendance to wins, using weather and game time as instruments for attendance.

The only way their story is dead wrong is if weather and game time have a large impact on home field advantage. This is plausible. This is what we should be arguing about.

(There is the secondary issue of whether the results are robust.)


#21    Tangotiger      (see all posts) 2010/09/16 (Thu) @ 20:42

In post 13, I quoted their abstract, and it’s 100% clear that the conclusion was unidirectional (and the wrong direction from our viewpoint).

Now, we can say f-ck you to their interpretation, and deal with the rest of the paper, and accept that there was a correlation such that 50% increase in attendance is linked to +.100 win% (god I love round numbers).

As someone else suggested, there may be a shared link, such that most, but not all, of the correlation is one way, and a bit is the other way.

As a lay person, please explain to me what the paper did that shows how we can interpret the direction of the link.

As for how *I* would do it, it’s pretty simple: use the Vegas odds as a proxy for true talent.  Is there a bias where Vegas short sells the wins based on games where the teams draw lots of fans compared to when the teams don’t draw lots of fans?

I will guess the answer is that there’s a 00.01% chance of bias.


#22    Depot      (see all posts) 2010/09/16 (Thu) @ 20:55

I’m a little bit confused about what the source of confusion is so I’m just going to say stuff and hope it’s relevant.

The point of the paper is to find the causal link between attendance and winning %.  The paper recognizes that the raw correlation between those two variables is _not_ causal. 

Here’s a nice experiment you’d want to have as a researcher.  You have data for each game on attendance and home team runs.  Again, the raw correlation is positive because both variables are determined simultaneously.  But you also know that on some randomly-chosen dates (picked out of a hat), the home team gave away free tickets.  You do the following: (1) Determine the relationship between attendance and “free ticket days.” Say it’s 5,000.  (2) Determine the relationship between home team runs and “free ticket days.” Say it’s 1.  You would conclude that the causal effect of a 1 person increase in attendance is 1/5000.  This strategy is nice because you’ve found a shock to attendance that has nothing to do with the run scoring ability of a team on that day...except through attendance.

So, such a great experiment doesn’t really exist and you have to ask yourself...how well do the researchers approximate such an experiment?  They use weather as their shock.  Can weather affect home team runs in ways other than attendance?  What about day of the week?

In summary, the general methodology is fine.  The shocks...much less so.


#23    Andy      (see all posts) 2010/09/16 (Thu) @ 21:02

Good example, Depot. Here is a decent, but slightly technical write-up of instrumental variable techniques in general: http://www.urban.org/toolkit/data-methods/instrumental.cfm

Economists love these because we cannot typically run experiments to answer questions. No one will let you randomize years of their child’s education. We’re forced to look for “quasi-random” variation in the world.

Freakonomics discusses these ideas in a very non-technical way, as well.


#24    MGL      (see all posts) 2010/09/16 (Thu) @ 21:54

"Is there a bias where Vegas short sells the wins based on games where the teams draw lots of fans compared to when the teams don’t draw lots of fans?”

No, Vegas does NOT in any way shape or form relate HFA to attendance.  And if anyone wants to do that, I will gladly take any amount of a bet.  The idea that if the Pirates are playing at home at an odd time on Thursday with only 11,000 fans, and then on Saturday, they are playing a contending team with a popular pitcher, say, Strasburg, and they expect 30,000 fans, that we expect them to score an extra run or so, is so preposterous, that to me, it is not even worth discussing.  Seriously.  And if it were true, anyone who knew that would be a a multi-millionaire betting against the Vegas line.  Have we lost our minds here?

And would that be so hard to test without using a regression model?  I don’t think so. I would merely look at all games for each team and divide them up into 3 groups - low, medium, and high attendance. I would then compare the top and bottom groups for each team.  I would look at expected runs scored based on the pitchers, hitters, park, and perhaps weather.  If there is a large difference between actual and expected it should easily show up in a study like that.  It won’t.  I’ll bet any amount of money that it won’t…


#25          (see all posts) 2010/09/16 (Thu) @ 22:21

"Can a regression distinguish the direction of the causality without doing some kind of longitudinal analysis?”

Nope. You can *assume* a direction for causation, and then do all kinds of cool stuff, but you always have to remember the analysis is based on an assumption, and if that assumption is wrong the analysis is junk. It’s plausible the attendance on one specific night could cause a change in winning or losing *on that same night*. It’s not so plausible that winning or losing on one specific night could cause a change in attendance *on that same night*, unless people are showing up after the game is over. Now, winning or losing can cause changes in attendance on *future* nights, but I don’t think this is an insuperable difficulty (though I’d need to think about it more).

(For what it’s worth, I don’t think the finding is accurate because I don’t think instrumental variable techniques are accurate.)


#26    Andy      (see all posts) 2010/09/16 (Thu) @ 22:25

Just because there is a small edge against Vegas does not make betting on it profitable, as you have to overcome the vig.

Seriously, though, instrumental variables is a regression approach! It’s just the estimate is Cov(Y,Z)/Cov(X,Z) where Z is the instrument. (The normal regression is Cov(Y,X)/Cov(X,X)). This is not (very) sophisticated statistics, just a slight complication of the basic model. As Depot’s example showed, it is just comparing groups based on weather, instead of just attendance.

MGL, in your approach, you are not eliminating the impact of reverse causation. Your groupings are endogenous - that is - if we expect low HFA that may cause lower attendance, which will show up in your groups as zero effect. We *must* have a reason to believe the attendance groupings are approximately random to estimate the impact of attendance on wins.


#27    Depot      (see all posts) 2010/09/16 (Thu) @ 22:28

Well, you’re making the exact mistake that Tango is worried about.  You’re just looking at raw correlations.  You don’t _have_ to use regression models, but you need to use exogenous shocks to attendance.  So, you can create categories based on those shocks.  Say you want to use promotions as the exogenous shock.  You have attendance and runs on promotion days and non-promotion days.  Your parameter of interest is (Runs_promotion - Runs_non)/(attendance_promotion - attendance_non)

The benefit of using a regression model is that you might think that promotions occur on certain days of the week and you want to separately control for the day of the week.  A regression just lets you do this more flexibly and increase the chances that your experiment is valid.


#28    Depot      (see all posts) 2010/09/16 (Thu) @ 22:30

Uh...I was responding to 24.


#29    Passing Statistician      (see all posts) 2010/09/16 (Thu) @ 23:18

"As for how *I* would do it, it’s pretty simple: use the Vegas odds as a proxy for true talent.  Is there a bias where Vegas short sells the wins based on games where the teams draw lots of fans compared to when the teams don’t draw lots of fans?”

This is kind of like fitting an instrumental variable model without the instrumental variables, and I don’t think it works.

Suppose the A’s play the Mariners, Angels, and Rangers at home. Suppose Vegas correctly sets the odds, based on the last few seasons, at a 60% win chance against the Mariners, 50% against the Angels, and 40% against the Rangers. Now, Vegas has got it right, so there’s no way that you can beat them.

But this doesn’t mean that attendance has no causal effect! It could that both the high historical win rate against the Mariners and the high win chance against them in the next game are *caused* by high attendance: the fans like to watch Ichiro and show up to the games, and this causes the A’s to play well. Now, this is much less likely than the reverse causality. But the possibility means thinking about things in this way is a poor way to measure the effect of attendance.

An instrumental variable model avoids this difficulty—when the instruments are strong, and when the causal structure is well-specified and has the directions are right, and when the conditional independence assumptions are correct. This is rarely the case, though.


#30    MGL      (see all posts) 2010/09/16 (Thu) @ 23:20

"MGL, in your approach, you are not eliminating the impact of reverse causation.”

And #27/28.

No.  If I compare actual RS to that expected by the long-term projection for the pitcher and the hitters, I do NOT need to do anything as far as direction of correlation.

If the thesis of the study is correct, then in any day where there is low attendance the RS will greatly underperform the expected RS, based on projections and on any day where the attendance is high, the actual RS will outperform the projected RS.  I think Tango can verify this.

Andy, as far as Vegas is concerned, again, if the thesis were correct, and the attendance effect were so large, someone could destroy the Vegas lines even with the vig. The vig is small in baseball (in most books).  Less than 2% or so.  Vegas always figures around 54% for the home team, more or less. If the true HFA for any game were even 56% (or 52%), and you knew that, that would be more than enough to bear the vig.


#31    Depot      (see all posts) 2010/09/16 (Thu) @ 23:56

Re: 30…

So, basically, you want to compare runs/game in high attendance games vs. low attendance games controlling for the lineup and opposing pitcher.  There are really 2 types of strategies with regressions to deal with causality - (1) IV with exogenous shocks; (2) control for everything else.  You’re suggesting (2) or a non-regression equivalent to (2).  So, now, the assumption is that once the players/pitchers are controlled for, attendance is totally exogenous.  (Sorry, I’m thinking while typing...) Honestly, I prefer approach (2) here over (1), i.e. MGL’s approach over the paper’s.  We don’t have good instruments so (1) is a problem and we could include dummies for each player and control for everything fairly flexibly.  I would be ok with that.  The only reason I would still consider an IV approach (assuming “promotions” are potentially a good instrument) is because attendance is likely measured with error and IV implicitly accounts for that while approach (2) would be biased.

I don’t completely understand this Vegas argument.  It’s basically, “Well, if you knew this, you could make lots of money.” That doesn’t really refute the argument.  If you know something Vegas doesn’t, you’d make money.  Or maybe Vegas already knows this and prices accordingly.  Or maybe they don’t, but predicting exogenous shocks to attendance is impossible.  I think the Vegas thing is irrelevant unless I’m misunderstanding.


#32    Andy      (see all posts) 2010/09/16 (Thu) @ 23:56

MGL, I think your technique does not correctly control for reverse causality, because when we see teams beat their projections AND have high attendance, we still don’t know which one caused the other.  I have to switch to math to think it through thoroughly, or I will get confused. WARNING: LONG, MATHY COMMENT FORTHCOMING, with some notational abuse.

Let Y be runs scored. Let Y_hat be long-run projected runs scored. As I understand your suggestion, you want to look at how (Y - Y_hat) changes as attendance changes.  If this difference is positive consistently when attendance is high that would be evidence that attendance is related to wins.

Ok. Let’s formalize a little. Let X be attendance. Your test is then effectively estimating an OLS regression of the form:

(Y - Y_hat) = B * X + e

where we want to test if B is positive. If so, we will conclude that attendance influences wins. We can do this with attendance groups too; all the ideas carry through but the notation is a lot messier.

What is e? It is the random luck that determines runs and wins. It is also potential contaminations - like people’s expectations that today will be a particularly good day so that both Y and X will be high. The problem with raw correlations can conveniently be summarized by saying that COV(X,e) is not equal to 0.

We gather data and estimate. What is the OLS estimate of B?  It is:

B=COV(Y-Y_hat,X)/COV(X,X)

Plugging in the definition of Y-Y_hat, we get that

B = COV(XB + e,X)/COV(X,X)

A little algebra and pulling constants out of COV(), gives us the new estimate

B = COV(X,X)*B/COV(X,X) + COV(e,X)/COV(X,X)

which simplifies to

B = B + COV(e,X)/COV(X,X)

So we’re there. We still have COV(e,X) not equal to 0, as discussed above. So the estimate of B from your method (at least, how I understand it), is still biased away from the true variable we care about. Testing whether our estimate of B is positive or negative is uninformative on the true sign of B.

Why is this? It’s because, when we see teams perform above projections AND have higher attendance, we cannot know whether that was because the attendance caused the performance or because some other factor caused the performance, which in turn caused the attendance. Both are interesting, but we are trying to uniquely identify the former effect.


#33    Andy      (see all posts) 2010/09/16 (Thu) @ 23:58

I actually find the fact that Vegas doesn’t adjust the odds based on promotions or weather and that these findings would let you beat the vig the strongest evidence against the authors’ results.


#34    Depot      (see all posts) 2010/09/17 (Fri) @ 00:03

Andy, I realized we posted simultaneously…

Just assume I can control for everything perfectly flexibly.  As in...I will see the same lineup face the same pitcher 1000 times.  During those 1000 times, the attendance will vary.  The MGL assumption would be that attendance variation is exogenous.  Which is really just saying that fans can’t predict when their team is going to score more runs given the lineup/pitcher or, even if they do, they don’t respond to it (oh, I have some knowledge that this lineup will score 0.2 more runs today for some reason, but that won’t affect my probability of attendance).  I don’t know...that seems reasonable to me.  Do you disagree with that, given this stylized example?


#35    Andy      (see all posts) 2010/09/17 (Fri) @ 00:10

Yeah - I do agree that if we can condition on enough variables that attendance is random then we will have a good estimate with just correlations. So I now think MGL’s approach may be better than I first thought. But it depends on the conditioning.

I am a little skeptical that we can condition on enough variables, although that might be the economist in me talking. Don’t assume you know more than the agents you’re studying! Did we include ALL the relevant information that *might* influence performance/attendance? Injuries? Usage? Last week’s press conference? And so on.  That stuff does seem like small potatoes though.

I will say that it’s really really hard to know if you’ve conditioned on enough stuff that whatever is left over and unobservable is independent. We’d only know that if we observed it!


#36    Depot      (see all posts) 2010/09/17 (Fri) @ 00:17

Yeah, Andy, I hate trying to control for everything too.  That’s why IV is so great.  I don’t know...you’ve kind of persuaded me back to the IV approach.  I’m flip-flopping a lot on this thread so maybe I’ll just go into cheerleader mode...everyone’s ideas are great!


#37    MGL      (see all posts) 2010/09/17 (Fri) @ 01:07

"MGL, I think your technique does not correctly control for reverse causality, because when we see teams beat their projections AND have high attendance, we still don’t know which one caused the other.”

Maybe so, but I don’t expect to find that. I expect that teams will NOT beat their projections in both the high and low attendance groups. In fact, I would stake just about anything on that result. In which case, that pretty much puts the kabosh on the author’s thesis.

And how would good teams causing large attendance (which is true) create teams that beat their projections in my study?  I don’t see that. If the team is good, then the projections will be good.  You only get good attendance when the team performs well in the past. If a team overperforms its projection for a month straight, the attendance might go up a lot, but they will NOT overperform in those high attendance games, unless the projection was not capturing something that is real, which I doubt.


#38    Michael K      (see all posts) 2010/09/17 (Fri) @ 11:54

General question: Does bad/wet/humid weather have an effect on run scoring?

If bad weather systematically reduces run scoring (regardless of HFA) then I think that might help explain the results of this paper.


#39    Jon      (see all posts) 2010/09/17 (Fri) @ 12:18

Heteroscedasticity and every other multisyllabic statistical term notwithstanding, the authors of the subject paper have done worse than manage to confuse correlation with causality. They have flat-out interposed independent and dependent variables.


#40          (see all posts) 2010/09/17 (Fri) @ 12:29

Hi All.  I just skimmed this paper.  Anyway, I’m looking for ideas for a thesis, and I thought redoing this analysis, in a way that makes sense might be a good topic.  Any ideas on how I could do this?  Or can we never account for the variables enough to get rid of the reverse causation problem? Thanks!

stevenellingson at hotmail dot com


#41    Andy      (see all posts) 2010/09/17 (Fri) @ 12:35

@Michael K: Weather affects performance. The authors control for this. The issue is whether weather affects HFA.

@Jon: They control for heteroskedasticity. What other statistical problems are there aside from instrumental validity? Your statement about interposing variables doesn’t make sense to me. Causality can run both directions. Instrumental variables helps us tease out one direction of that causality. No one is disputing that wins influence attendance, but that is not the issue here.


#42    MGL      (see all posts) 2010/09/17 (Fri) @ 18:40

"General question: Does bad/wet/humid weather have an effect on run scoring?”

Of course weather affects RS. Hot weather = higher RS.  Humid weather, no effect (lighter air probably cancels out less COR of ball).  Wet weather, higher RS.

BTW, if I do my study whereby I compare projected RS to actual RS for each team and two groups for each team (and then combine all teams), low and high attendance, I can get the following results:

(BTW, the projected RS is based on weather, the to-date (or pre-season) projection for the opposing pitcher and the batters in the lineup (hitting, defense, and base running).)

1) In both groups - high and low attendance, I get no difference between projected and actual and I get no difference in RS between high and low groups. That means that there is virtually NO correlation within season between RS and attendance, regardless of the direction.

2) In both groups, projected and actual are the same, but high attendance has a higher RS than low attendance. That means that there IS a correlation, but it is that better offense (or worse opponent pitching - which is unlikely) attracts more fans, but NOT the other way around.  Maybe on days where team offensive stars are not playing, fans do not show up.

3) High attendance has higher RS than low attendance, AND in the high attendance/high RS games, the actual is higher than the projected.  This means that the crowd is causing higher runs scoring than expected (from the talent on the field).  This is what the study suggests is happening.  I’ll eat all of my hats if this is the result of my study…


#43    Depot      (see all posts) 2010/09/17 (Fri) @ 20:27

Steven (40): if you want to do this topic, just try to find good instruments.  I’d recommend trying something like comparing attendance at games when the city’s NBA/NFL team is playing at home vs. nights that they’re not playing at home within the same series.  That might be a nice shock if there’s enough overlap in seasons.

Why is there such a bias against finding _any_ effect on this dimension?  We have no evidence at all.


#44    Andy      (see all posts) 2010/09/17 (Fri) @ 22:03

If I were to write a paper on this, the first thing I would do is check whether these results could possibly let you beat Vegas. I am super curious if I can bet the home team on nights with promotional events and make money. I suspect the answer is NO, shouted from the rooftops, which makes me skeptical about these results.

But if the answer is “yes”, or even if it’s a marginal “no”, then I would feel much better about the findings.


#45    Bob      (see all posts) 2010/09/18 (Sat) @ 12:39

The MLB-reported “attendance” data is actually “tickets sold”, not “turnstile count”. (Neither league has publicly reported turnstile count since the NL converted in 1993.  That’s why they announce “paid attendance” at the ballpark.) Since the no-show rate could be expected to vary by time of year, weather, a particular home team’s standings-- e.g. pre-season sales for a 4th place team-- turnstile count and tickets sold do not correlate perfectly.  Of course, empty seats, even if sold, do not cheer on the home team players.  This seems like a significant flaw in the data for the author not to even acknowledge (if he or she is aware of it).


#46    Andy      (see all posts) 2010/09/18 (Sat) @ 12:49

Bob,

That is another advantage of the instrumental variables approach the authors use, as it gets rid of bias arising from measurement error. You are absolutely right that this could be a problem with other approaches, though.


#47    J. Cross      (see all posts) 2010/09/18 (Sat) @ 14:19

I think the idea here is possibly quite clever (depending on the implementation, I guess).  We have to think about the “shocks”:

*day of the week seems questionable for reasons mentioned earlier.

*game time?  Is HFA independent of game time?

*weather?  Is HFA independent of time during the season or greater in September than April?  If both teams score more on a hot day is the magnitude of HFA affected?

*promotions?  Are promotions offered more frequently against lousy opponents?  I think it would be tough to accurately control for team quality.

I like depot’s idea of looking at NBA/NFL teams in town.

Anyway, I think the idea here is interesting enough that we should give them a little credit even if the magnitude of the effect they found seems unreasonable large.


#48    dcj      (see all posts) 2010/09/19 (Sun) @ 00:03

I know very little about IV methods, so I’m trying to write down what I interpret the authors to be doing. I’d appreciate feedback as to whether I’m getting it right.

I am looking at pages 22-23 of the paper. The authors use the following variables:

- Temperature
- Weekend game (includes Friday night)
- Midweek day game (Mon-Fri)

---

- Rain (yes/no variable, yes if it rained there at any time that day; 38% of games are marked yes)
- Domed stadium
- Dome*Rain
- Home average to date
- Visiting average to date
- Visiting season record
- Home pitcher ERA (presumably, that season)
- Visiting pitcher ERA

They use the first three variables as instruments and the rest as controls. (Is “controls” standard terminology?) The controls are supposed to affect the score differential both directly and through changes in attendance, while the instruments are supposed to affect the score differential only through changes in attendance.

Now they run a 2SLS regression. In the first stage they write attendance (% of capacity) as a function of the instruments and the controls. In the second stage they write score differential as a function of the estimated attendance from the first formula and the controls.

The first formula looks like

attendance = a*temp + b*weekend + c*(midweek day) + (d_1,...,d_8)*controls + constant

and the second formula is

score diff = e*attendance + (f_1,...,f_8)*controls + constant.

Let me define

attend2 = a*temp + b*weekend + c*(midweek day) + constant,

that is, isolating the contribution of the instrumental variables to attendance. Then if you were to run a least-squares regression estimating score differential from attend2 and the controls, you would get

score diff = e*attend2 + (e*d_1+f_1,...,e*d_8+f_8)*controls + constant.

At least, I think so. The point is that the coefficient on attend2 should be equal to the coefficient on (estimated) attendance in the previous formula.

Question: is it a problem if the instruments correlate with the controls? For instance, temperature is probably inversely correlated with rain.

Other question: I ran a simple regression estimating score differential just as a function of attend2. I found essentially no relationship. Is this consistent with the authors’ results?


#49    Depot      (see all posts) 2010/09/19 (Sun) @ 01:23

dcj, I might not answer your questions so just tell me if I don’t.

With 2SLS, it is essential that you include the same control variables in both stages.  You get inconsistent results otherwise. 

You can do 2SLS without any controls in either stage.  In many cases, you think your instruments are exogenous regardless of inclusion of controls.  So, really, your estimates should be the same whether you use or don’t use controls.  It’s a good test.  Now, sometimes you have a really good reason for why your instrument is exogenous _only_ conditional on the covariates but, generally, I would say that’s not true.  Maybe.  In this case, I would hope that the results would be the same regardless of whether controls are used.  That’s actually a good test for whether the instruments are valid.


#50    dcj      (see all posts) 2010/09/19 (Sun) @ 17:41

Depot, thanks for the response.

With 2SLS, it is essential that you include the same control variables in both stages.  You get inconsistent results otherwise. 

Yes. What I was doing with “attend2” didn’t break that rule, though I was probably unclear. In any case it doesn’t much matter.

In this case, I would hope that the results would be the same regardless of whether controls are used.  That’s actually a good test for whether the instruments are valid.

I ran the analysis without any controls. Here’s what I got.

First stage:
attendance = 6382*(weekend game) + 2150*(midweek day game) + 94.6*(temperature) + 19150

In the paper they have an average attendance of 29675, which is 60% of capacity. So for them, the average stadium capacity is around 49500. If we take my first stage equation and divide by 49500, we get:

attendance_fraction = .129*(weekend game) + .0434*(midweek day game) + .00191*(temperature) + .387

Compare this with the paper’s first stage equation including controls:

attendance_fraction = .133*(weekend game) + .0452*(midweek day game) + .00264*(temperature) + (controls and constant)

This is pretty good agreement.

My second stage equation:
score_diff = -.000000522*(attendance) + .131

or

score_diff = -.026*(attendance_fraction) + .131

with a standard error that completely swamps the difference from zero.

Their second stage equation:
score_diff = 2.615*(attendance_fraction) + (controls and constant)

So my result is very different from theirs.

Technical notes: like the authors, I am using seasons 1996-2005. They have 18711 games with full information; I have 23333. I got my weather data from a different place than they did, but the coefficients in the first-stage equation ended up being very comparable, so I doubt that’s the issue.


#51    MGL      (see all posts) 2010/09/19 (Sun) @ 18:06

I am reading a book called, “Wrong, Why Experts Keep Failing us and How to Know When not to Trust Them,” by David Freedman.

One of the themes of the book is that the number (percentage) of faulty scientific studies is astounding, even those published in the most prestigious journals, like Science and Nature, and that the default position should be that any given study is wrong for a variety of reasons (publishing bias, fraud, poor methodologies, sloppy work by the researchers and their assistants, etc.).

If that is true (one of the ironies of the book is that the author supports his thesis by citing studies!), then it should not be surprising if this study were wrong.

Given that the results intuitively (if you are a sabermetrician) appear to be ridiculous, at least the magnitude of the results, I think that there is greater than a 95% chance that this study is just plain wrong/bad, for whatever reasons…


#52    Depot      (see all posts) 2010/09/19 (Sun) @ 18:12

dcj,

I’m impressed you’ve tried to replicate.  So, it doesn’t look like you included ballpark fixed effects, though it’s hard to tell.  Make sure you include those.  I think the authors would say that their instruments are exogenous conditional on ballpark fixed effects.


#53    Tangotiger      (see all posts) 2010/09/19 (Sun) @ 18:55

I agree with MGL.  As an interested observer, all I see is mathematical gymnastics, which may be fun in its own right.  But, I really won’t believe any results I see, other than the smidgest of impacts.

It reminds me of when others do regression on runs scored, and they get doubles as .15 runs higher than a single (when the true number is close to .30 runs).  I really don’t care how, why and how well you can justify the .15.  It’s a wrong number, an unbelievable number, something that can be proven to be wrong.

Same here with attendance and wins.  If someone were to report a number one-tenth of what was published, I might still be skeptical.


#54    Depot      (see all posts) 2010/09/19 (Sun) @ 19:26

I just think that’s a bad attitude to have.  Regression and IV/2SLS have a somewhat complicated mathematical foundation, but that doesn’t mean regression results are wrong.  IV is a very accepted method in empirical work and it’s extremely valuable.  If you like taking means, you should like regression.  Yes, regression can be misused but that’s never a reason to dismiss a method.  In this case, IV is exactly what you want to use and the only debate should be about valid instruments.

Similarly, dismissing results because you don’t agree with them is terrible form.  Address the methodology first.  If you think the methodology is ok, but the results still look strange, then try reinterpreting the results.  In this case, the methodology doesn’t look great so we can stop there.  Otherwise, I’d worry about what the results are saying since - as discussed before by Millsy - the authors didn’t do a great job of parametrization.  I’m just suggesting a better approach than everyone going with blind faith, “Nope, that paper is wrong since it doesn’t get a 0 effect and 0 is the right answer.”


#55    Tangotiger      (see all posts) 2010/09/19 (Sun) @ 19:38

I’m not qualified to speak to the technical aspects.

But, just because I don’t understand the process doesn’t mean I can’t have an opinion as to the results.  It also doesn’t mean I have to go ahead an learn the methodology so I can prove the b.s. of the interpretation.

If I see +50% in attendance = +.100 in win%, in no way can anyone believe that that relationship is unidirectional and in the cause-effect of attendance to wins.  I’ll let the people smarter than me prove that the smell test is right.

Had I seen +50% -> +.010, now that would have been more believable, and then I would have kept an open mind.  I close my mind when I see b.s., and +50% -> +.100 is b.s.


#56    J. Cross      (see all posts) 2010/09/19 (Sun) @ 20:16

I am reading a book called, “Wrong, Why Experts Keep Failing us and How to Know When not to Trust Them,” by David Freedman.

I’ve got it on hold at the library so I should be reading it soon.

btw, I just read “Statistical Inference” by Michael Oakes that gets into some of these issues as well as others we’ve hashed out on this blog and is a book I’d recommend.


#57    MGL      (see all posts) 2010/09/19 (Sun) @ 22:30

Ditto what Tango said in #55!

“I just think that’s a bad attitude to have.”

What, to be extremely skeptical of a result which appear on its face to be ridiculous?  That is exactly the attitude that should be had.

Dismiss the study?  I don’t know that anyone is “dismissing” it.  And if they are, so be it.

Maybe all studies should be dismissed whose results on their face appear not to be believable (not by lay persons but by experts themselves, like Tango and myself). If someone else independently gets similar or the same results, then we can start to take it seriously.


#58    kds      (see all posts) 2010/09/19 (Sun) @ 22:55

I don’t like temperature as a variable.  I think it is a poor substitute for “school in session”.  Yeah in April or September people might decide to not get tickets when it is very cold, but I don’t think many more are going in July or August when the temp goes over 90.  I think this shows sloppy thinking on their part.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 05:00
Help needed with sticky issue…

May 25 04:38
The first time a pitcher has ever intentionally thrown at a batter….

May 25 03:39
Lack of hustle during a game

May 25 02:54
Largest demonstration in Canadian history?

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story