Tuesday, March 02, 2010
Attendance and winning
Sky takes a look. A similar study was done a year or three ago in By The Numbers. If Phil is around, maybe he can link to it.
Buy The Book from Amazon
Sky takes a look. A similar study was done a year or three ago in By The Numbers. If Phil is around, maybe he can link to it.
Found it:
http://www.philbirnbaum.com/btn2003-08.pdf
Article by Darren Glass. Here’s how he started it:
Methodology
I looked at all seasons from 1973 until present. In particular, I looked at the correlation coefficients between the following variables:
• Average home attendance per game (ATT)
• Home attendance per game divided by Average Home attendance over all teams (to normalize for nation-wide trends) (ATT/AVE)
• Final place in divisional standings (PLACE)
• Winning Percentage. (WIN)
I think the more recent seasons is key by the way.
Interesting couple of links. I do like my approach better, but I guess I am biased
Tom, I limited the data to 1981 and onward, and all of the results stayed pretty much constant. Coefficients changed a little, but all the conclusions remained the same. You also get similar results if you extend back to 1901. The forces that bring folks to the park have remained remarkably consistent.
Sky: whoah, very cool.
Instead of log, what happens if you change all the attendance figures to an index to that year. So, average = 100 each year.
I have a good reason to not like logs, which I’ll try to articulate later.
I’m not a big fan of regressing on a ratio like that. Also, I can’t create that index so easily right now, so that will have to wait.
I did try using the raw attendance figures on the 1982-2009 data. That’s not ideal, but I got pretty similar results, with the added bonus that coefficients are easier to interpret.
Why don’t you like logs? It’s a quite useful transformation.
Sky, for now, let me just say that you end up trying to minimize the error of the log. And that means extreme data points will have the same error distance of non-extreme data points, just because the log-fit is better.
But in reality, we want to minimize the square of the actual data points.
I’ll give you an example tomorrow if you can’t read my mind…
One thing I always wanted to see was some sort of table showing attendance by day of the week, month, day or night, divisional foe, opponent record and so on. I’ve been curious to see if interleague play actually helps attendance—since those games are played while school is out, they’d have a higher attendance expectation.
I left the following comment over at Baseball Analysts:
First, the analysis is interesting, and the results are more-or-less consistent with a lot of published attendance studies. But it’s not clear whether your analysis is a multiple regression analysis or a series of bivariate analyses. The results can change as you include more explanatory variables, for any number of reasons. (For example, multicollinearity between the explanatory variables.)
Second, the “ticket price effect” is complicated, because determining an appropriate ticket price is not straightforward. There are, essentially, two approaches. The first takes annual ticket revenue and divides it by annual attendance (defined as tickets sold, not people who actually show up). The second constructs a weighted average ticket price, where the weights are the percentage of seats available at a particular price, regardless of whether they sell or not. These will almost certainly yield different “ticket prices.” I believe the second approach is theoretically preferable, but there’s a lot of disagreement about this.
The first approach (ticket price as average ticket revenue per attendance), for example, is likely to yield “ticket prices” that that decline as attendance rises, if fans tend to buy the best available seats first. So the declining-ticket-price-is-associated-with-higher-attendance finding is NOT lower-ticket-prices-lead-to-higher-attendance, BUT rather, higher-attendance-leads-to-lower-priced-tickets-being-purchased...the causation is from attendance to ticket prices, not from ticket prices to attendance.
The other ticket price weirdness is that a lot of published studies find that higher attendance is associated with higher, not lower, ticket prices…
The one factor that doesn’t seem to be mentioned (or even considered) is market size. Granted part of the problem is getting historical market sizes (good exmaple is the change in Cleveland and Pittsburgh actual population). I guess my assumption would be that all things being equal, KC (29th in market size) will have a much harder time drawing than a mid-market like a Houston or Texas.
Oh, there was an article dealing with this topic in one of the last Big Bad Baseball Annuals. The author (I think it was Brock Hanke) looked at opening day payroll and found that to be a big deal as well.
Interesting. Payroll could be used as a proxy for “hope”, over and above whatever their wins were the past 2 years.
The market size problem that Tom Kniker mentioned in Post#9 also brings up the problem of confusing correlation with causation. A larger market size in all probability leads to larger average attendance which likely leads to more available funds for investing in better players which may lead to more playoff appearances and World Series wins. Such a scenario would negate some of the causal conclusions that Sky is drawing from the positive correlations in his regression analysis.
Thanks for the comments…
Tango/6, When you do the transformation, you are trying to reduce heteroskedasticisty in the data. If you don’t do that, the data with a lot of variability (modern data) will get far too much weight compared to data with a small amount of variability (data from 1910).
You’re right, that you’re no longer minimizing the square of the actual points, but when the variability varies a lot, you don’t actually want to do that. For regression to work properly, you need the SD of the error to be constant across all points of your data, and the log transformation does this.
Donald/8, The analysis is just one multi-variate model - sorry if that wasn’t clear - otherwise the analysis would be worthless for reasons you mention. As for ticket prices, they shouldn’t have a dramatic effect (see the link in #1 showing that they are not statistically significant). As for finding the optimal ticket price, that’s a whole other issue.
Tim/9, Definitely market size would be a factor. Market size should be caputred in the team “brand” random variable, so the omission shouldn’t confound the results however. It would be interesting to look at that on its own. I would suspect your theory is right…
Peter, That would be a problem if I didn’t control for team. However, by including the team as a random variable, we avoid this possible confounder.
"For regression to work properly, you need the SD of the error to be constant across all points of your data, and the log transformation does this. “
Which is why I suggested indexing.
***
Thanks for reminding me, as I will give you an example of why logs don’t work. After lunch.
"Making the playoffs the year before raises a .500 team’s attendance by about 3,000 fans per game - a major boost. Obviously making the playoffs raises hype around the team, and this appears to manifest itself in the form of increased attendance.”
Making the playoffs the previous year, winning the league, winning the world series all matter (in a study I did some years ago). But it’s probably not just a matter of “raising hype.” It’s a matter of raising season ticket sales. Those sales are probably driven much more by previous season’s record (specifically making playoffs, etc.) than by current.
So a refined analysis—if data is available—would look for lagged effects of wins (making playoffs, etc.) on season ticket sales and for in-season effects of current wins on walk-up or game-day sales.
TT/11: Yes. The equation he had included wins this year, wins last year, and opening day payroll.
Sky:
You report the impact of each variable as a number of additional fans per game generated. But to be slightly more precise, don’t your coefficients really tell you the percentage impact? That is, when you say making the prior year’s playoffs is worth 3,000 fans per game, your equation is actually saying it boosts attendance by 13%. So it might be worth just 2,000 fans to the Pirates, but 4,000 fans to the Yankees. Isn’t that right?
However, the log model assumes the impact will be multiplicative (+13%) rather than additive (+3,000). So I was wondering if you’ve looked at the data to see which approach produces a better fit?
Doing it with # of fans is the most reasonable, it seems to me. When you use population size (in the metro area) as a predictor, you can also make an adjustment for whether there are competing ML teams in the region (LA, NY, Chicago, specifically). Don’t split the population, just put in a dummy for “two teams”.
Guy, You are right. I just converted them into numbers for the average team for 2009. Describing it the way you did probably would have been smarter.
But, yes, the way I have modeled it, the boost is worth more (in absolute terms) to the Yankees than the Pirates, and more to the average 2009 team than the average 1960 team. This makes intuitive sense to me which is why I went with that approach. I’ve tried running a regular linear model as well, although it’s difficult to directly compare the two different fits.
A multiplier effect seems more intuitive to me as well, but it would be nice to know for sure. (And I suppose the answer could be different for different variables.)
But I have a problem with the log model and the assumption of a multiplier effect when it’s used for player salary models (as it often is). It doesn’t make sense to assume that playing catcher, for example, increases salary by X% --the impact should be a fixed amount (the dollar value of the position adjustment). Same for any given quantity of offensive production, which has a specific dollar value—it doesn’t magnify the value of a player’s other contributions.
You young folks may also like these:
Voros
http://www.baseballthinkfactory.org/files/primate_studies/discussion/mccracken_2002-08-19_0/
Zumsteg
http://www.baseballprospectus.com/news/20020815zumsteg.shtml
Feb 11 22:49
Clutch analogy
Feb 11 22:08
Who is Jeremy Lin?
Feb 11 20:11
Fighting leads to goals?
Feb 11 19:55
Why do players get crappy caps?
Feb 11 19:12
Hero of the month: Brittney Baxter
Feb 11 17:59
MGL: Today on Clubhouse Confidential
Feb 11 16:48
Reader Mail of the Day: Why do we need X years of fielding data? And what about outliers?
Feb 11 10:29
Dwight Evans
Feb 11 02:12
Performance through the ages
Feb 10 23:01
For Your Soul
Looks like the same results as this person got:
http://armchairgm.wikia.com/Predicting_MLB_Attendance:_Multiple_Regression_Analysis_of_MLB_Attendance_and_Ticket_Prices