THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, September 10, 2010

Full bayesian way to do regression to the mean

By Tangotiger, 06:09 PM

I’ll look on Monday…

Hi Tom,

I came across an excellent series of posts (a year old, but new to me) on the fully Bayesian way to do regression to the mean. The example they use is batting average, and they provide some code to replicate what they’ve done in R. The last three posts in particular I thought were fantastic.

http://lingpipe-blog.com/2009/09/09/what-is-bayesian-statistical-inference/
http://lingpipe-blog.com/2009/09/11/batting-averages-bayesian-vs-mle-estimate/
http://lingpipe-blog.com/2009/09/15/moment-matching-empirical-bayes-beta-priors-batting-average/
http://lingpipe-blog.com/2009/09/23/bayesian-estimators-for-the-beta-binomial-model-of-batting-ability/
http://lingpipe-blog.com/2009/11/04/hierarchicalbayesian-batting-ability-with-multiple-comparisons/

Some of this relates to the recent posts by Vic Ferrari about the beta distribution formulation of the Marcels, as well as past work by people like Jim Albert and Brad Null.

- Eli


#1          (see all posts) 2010/09/10 (Fri) @ 19:28

Although regression to the mean is a useful shortcut in many instances, it feels much “purer” to me to do the full blown Bayesian inference.  Thanks for the links.  However, one question bugging me is why use the beta distribution as the prior?  OK, I actually know the reason: to make the math much easier.  But is there any other justification for choosing such a prior?  I can’t think of any, but then again, I don’t have a better idea either.


#2    David Pinto      (see all posts) 2010/09/10 (Fri) @ 21:51

Mickeyg13, major league baseball batting averages are not normally distributed.  The peak of the distribution is closer to the minimum, which is what a beta distribution looks like.


#3    Eli      (see all posts) 2010/09/10 (Fri) @ 22:06

I found a few typos in the R/BUGS code if anyone is trying to replicate the results.

For the BUGS model code, change “for (j in 1:J) {” to “for (j in 1:N) {”.

And in the R code, change “list(ability=rep(N,0.30),” to “list(ability=rep(0.30,N),”.


#4          (see all posts) 2010/09/10 (Fri) @ 23:08

@ #2 David Pinto

Yep, definitely not a normal distribution.  It looks like the tail end of the general population’s normal distribution.  It follows (in the NBA) the curve of the general population normal distribution until about league mean, and then breaks away before actually peaking a little below the mean.  See http://sonicscentral.com/apbrmetrics/viewtopic.php?p=31675#31675 for my chart of that.


#5          (see all posts) 2010/09/10 (Fri) @ 23:13

#2, I understand that MLB batting averages are not necessarily normally distributed, and you described one property that the beta distribution shares with the actual distribution of MLB batting average talent.  I’d like something more to go on than that though (but as I said, I don’t have a better idea).


#6    Tangotiger      (see all posts) 2010/09/11 (Sat) @ 09:20

http://www.tangotiger.net/talent.html


#7    DSMok1      (see all posts) 2010/09/11 (Sat) @ 11:09

Wow, I didn’t know you had that!


#8    phorever      (see all posts) 2010/09/11 (Sat) @ 13:57

ahhh… this stuff is what i meant when i told tango i found markov chains too simple over on backshegoes.com.  love the eventual use of gibbs rather than monte-carlo too, although i suspect that genetic or still more sophisticated algorithms for sampling of the model space will be needed once the dimensionality gets over 5 or so.  oh, and cluster computing would be nice.


#9    Sunny Mehta      (see all posts) 2010/09/11 (Sat) @ 17:39

mickey/#1,

Batting average and other sports statistics like OBP, save percentage, etc are binomial proportions (i.e., two outcomes - success and failure which are divided by [success + failure]). Therefore we assume the inherent variance in them to be distributed binomially. The beta distribution is just the continuous form of the binomial distribution.


#10          (see all posts) 2010/09/11 (Sat) @ 20:43

Sunny, I appreciate that the *observed performance* of batting average et al follow the binomial distribution.  What I’m saying is that it does not automatically follow that the prior distributions are beta distributions; the prior distributions are related to the spread in talent in the underlying population.  That the beta distribution is the conjugate prior of the binomial distribution makes for some convenient math, but it still feels a little like we are assuming the cow is a sphere.  I’ve made my fair share of such assumptions before so at the end of the day I’m OK with it, but it still feels a little unsettling.


#11    Sunny Mehta      (see all posts) 2010/09/11 (Sat) @ 21:41

edit to my post #9:  second line should read “(i.e., two possible outcomes - success or failure which is divided by [success + failure])”

Also, to expound on the topic, while I’m pretty much okay with assuming the variance in these stats is distributed binomially, I agree with you that the decision to choose a beta curve to represent the ability distribution (i.e. prior) is certainly debatable.

For the models I’ve run on these types of stats, the beta prior works really well. But there’s nothing that says it HAS to work well. There could be a situation where I think the ability distribution is bimodal, multimodal, or shaped like a zebra. And maybe you might disagree totally and think it’s shaped normally, or uniformly, or whatever.

And that’s the beauty of it. We can each build our own models however we like. Perhaps I might say mine is more predictive than yours. And perhaps you might disagree, and we solve the dispute by betting on it. In time we’d come to know which model is better.

smile

(Or maybe we wouldn’t. Maybe one of us would be old and dying and convinced that our model is actually better but that we’ve just run really bad for the past 60 years. And who knows, maybe we’d be right.)


#12          (see all posts) 2010/09/12 (Sun) @ 21:38

mickeyg13:

I agree wholeheartedly with everything you’ve said.  The assumption that ability is distributed in beta form is made for convenience only.  There is no reason on earth to think that nature will oblige (though by coincidence alone, it nearly does for several baseball stats).  To me it’s a red flag if an author doesn’t acknowledge that up front.

I think that using terms like ‘full Bayesian inference’, while correct, makes this seem more complicated than it is, though.  We’re just talking simple joint probability here.

The conversation probably reaches a wider audience, one with more insight into the game, if we think of ability in MLB as a histogram.  Then the results of any one player as another histogram (flip of a weighted coin, i.e. binomial).

Plot them out, multiply the heights of all overlapping bins ... voila, joint probability.  A Commodore 64 and tenth grade math is all that’s required from the technical side.  Understanding the game, along with trial and error, moves it forward from there.  Or so I think.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 06:43
Largest demonstration in Canadian history?

May 25 06:39
Lack of hustle during a game

May 25 05:00
Help needed with sticky issue…

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 23:50
Rooting for laundry

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story

May 24 09:41
Racial bias in card collecting: not the collectors, but the players on the cards