THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, September 09, 2010

Using more specific populations for regression

By Tangotiger, 11:19 AM

Fantastic job by Hawerchuk, and applicable for anything you do. 


#1          (see all posts) 2010/09/09 (Thu) @ 11:43

I’m not sure I 100% understand ... he used the prior distribution of shooting talent to estimate the underlying talent, based on observed?

Shouldn’t that create *more* overlap between Kovalchuk and Shelley, rather than less?

Maybe I’ll read it again ...


#2    Tangotiger      (see all posts) 2010/09/09 (Thu) @ 12:02

He used the fact that Kovulchuk’s quantity of shots is so high, that he must come from a more select set of shooters, presumably good ones.


#3          (see all posts) 2010/09/09 (Thu) @ 12:06

Ah, got it.  Thanks.


#4          (see all posts) 2010/09/09 (Thu) @ 14:23

Phil

He’s taking the joint probability of A & B, nothing more or less.  This while assuming A & B don’t covary.

In this case ‘A’ is the naive likelihood distribution of the shooter’s ability.  In baseball that’s typically determined binomially by bloggers, at least implicitly.  In this case Gabe went to the trouble of building a hypergeometric model to determine that (shades of post-Red-Sox Bill James).  That’s computationally intensive stuff.  I wouldn’t have thought of that, but I’m a fan of the reasoning.

And in this case ‘B’ is the distribution of shooting talent in the population.  In this case Gabe assumed it was distributed in Guassian fashion and used the exact equivalent of the Z-scores method for determining the variance.  So that’s very dodgy, I’d advise against using this information for wagering.  Still, it gets us into the neighbourhood.


#5          (see all posts) 2010/09/09 (Thu) @ 15:20

Phil

A strikingly similar analogy, appropriate here because Gabe has provided visual aids for me in his linked post smile Marcel assumes the distribution on talent in the population is always:

g(p): p240η-1 * (1-p)240(1-η)-1

where η is the population average.  This would be distribution ‘B’.  This for any hitting stat.  Marcel, somewhat questionably, assumes that the amount of luck in any hitting component (the integer valued at 240) is the same.  Be it singles/BIP or BB/PA or whatever.

Marcel then calculates the naive binomial likelihood for each season.  Which is wholly sensible imo.  He discounts the information from more distant seasons by pretending that they came from smaller samples.  That seems like a crazy idea at first blush, but since Marcel’s goal is to remain uber-simple in form, it’s actually a clever end run around a lot of other effects (managerial, ballpark, platoon, aging, selction bias, etc).

s1(p): pBB * (1-p)PA-BB

s2(p): p.8*BB * (1-p).8*(PA-BB)

s3(p): p.6*BB * (1-p).6*(PA-BB)

We’ll call these A1, A2 and A3 distributions.

Take the simple joint probability of all four distributions and you have the density function of the marcel forecast.  Since all equations are of the beta form (pα-1 * (1-p)β-1, it is trivial math.

Take the mean of that joint probability pdf: α/(α+β) and voila, you have the marcel forecast exctly, each and every time.

So, as a simple example, forecasting the BB/PA of a rookie.  If he had 30 walks in 300 PA in his rookie season:

joint pdf: p240η+BB-1 * (1-p)240(1-η)+PA-BB-1

So α = 240η+BB
and β = 240-240η+PA-BB
and α+β = 240+PA
and the mean = α/(α+β) = (240η+BB)/(240+PA)

Which is your Marcel forecast (I’m certain that Marcel would be fairing better in his contest if he used the mode instead of the mean, but that’s neither here nor there).

The real test of the reasoning that built this beast, or any other like it ... that would the resiliency of the joint pdf under scrutiny.

Personally, I don’t give a fig about fantasy leagues or forecasting system results per se. Though you should have borrowed Jim Albert’s numbers from an article he wrote at your BTN publication and entered it in the forecasting contest, just for the helluvit.

I am interested in the thinking, and the thinkers, behind them. 

That all makes sense, no?

Feel free to correct me if I’ve made any mathematical or typographical errors above.


#6    DSMok1      (see all posts) 2010/09/09 (Thu) @ 15:28

"He used the fact that Kovulchuk’s quantity of shots is so high, that he must come from a more select set of shooters, presumably good ones. “

I’m pretty sure that’s incorrect, unless I misunderstand you.  I believe he just used the Bayesian prior based on the normal distribution of players, and then multiplied by the observed distribution in a true Bayesian updating sense.  The normal “P(A|B)=[P(B|A)*P(A)]/[P(B)] “.  If the width of the prior distribution is much less than the width of the observed distribution, then P(A|B) is much narrower than the observed distribution.

I looked into some of the math here: http://sonicscentral.com/apbrmetrics/viewtopic.php?p=28221#28221


#7    Tangotiger      (see all posts) 2010/09/09 (Thu) @ 15:31

Marcel, somewhat questionably

Not only questionable, but wrong!  But, Marcel is looking to make things very easy.  And, he usually ends up in the middle, or maybe a bit better, in all these forecast tests.

Occam’s Razor.


#8    DSMok1      (see all posts) 2010/09/09 (Thu) @ 15:36

Nicely done, Vic!  I worked through similar math when working on projection systems for the NBA.  (My previous post, which seems to have been flagged as spam, has a link to the thread at APBRmetrics where I discussed this.)

I do note there is the issue that the populations are not normally distributed, since they are actually the tail end of the general population’s normal distribution.  A chart of that fact can be found here, for the NBA: http://sonicscentral.com/apbrmetrics/viewtopic.php?p=31675#31675


#9    DSMok1      (see all posts) 2010/09/09 (Thu) @ 15:43

"Marcel then calculates the naive binomial likelihood for each season.  Which is wholly sensible imo.  He discounts the information from more distant seasons by pretending that they came from smaller samples.  That seems like a crazy idea at first blush, but since Marcel’s goal is to remain uber-simple in form, it’s actually a clever end run around a lot of other effects (managerial, ballpark, platoon, aging, selction bias, etc). “

The “appropriate” way to do this would be to multiply a transformation distribution that accounts for year-to-year variation--the propagation of error from that multiplied distribution would account for the greater observed variance as more years pass.  In other words, to translate from 1 year to the next, multiply the year’s observed distribution by 1 (or better, the aging factor) +/- a standard deviation that accounts for the yearly uncertainty.  Then multiply again for the next year-to-year transition, etc.


#10    Tangotiger      (see all posts) 2010/09/09 (Thu) @ 16:05

I invite you to Forecasters Challenge 2011, and you can use whatever appropriate methods you like.


#11    DSMok1      (see all posts) 2010/09/09 (Thu) @ 16:14

Thanks, Tango!  I doubt I’ll have time to put together a projection system, however.  I’m focusing on trying to get something together for the NBA--there is a much larger void there.


#12    Tangotiger      (see all posts) 2010/09/09 (Thu) @ 16:29

Looks like I’m wrong about what Gabe did.  I thought he used a different prior for each.


#13    DSMok1      (see all posts) 2010/09/09 (Thu) @ 17:13

"Looks like I’m wrong about what Gabe did.  I thought he used a different prior for each. “

I do that all the time for the NBA--I construct a prior based on the MPG allotted by the coach and the efficiency differential of the team.  I would have included age in the said prior, but it wasn’t statistically significant (I had thought the younger players are given more playing time despite lower production).  I use the prior to stabilize the VORP calculations for players with few minutes played.  http://sonicscentral.com/apbrmetrics/viewtopic.php?p=32043#32043


#14          (see all posts) 2010/09/09 (Thu) @ 22:51

DSMok1 comment 9:

That strikes me as madass.

The question is:  Why do hitters and pitchers see their underlying abilities drift so significantly from September to April, yet appear to be very steady from April to September.

I don’t think we answer that with math effery.  We answer that by rolling up our sleeves and understanding the game better.  Read James, especially his recent stuff.  Tango is very good also (his math not so much, but we’re talking about the game now).  Wyers is also very good, again his expressed thinking and his mathematical representation of same are often at odds with each other.  But the thinking is good, and I can translate it to math myself.

Einstein has many great quotes about the universe, one of my favourites goes something like; ‘pure math is all well and good, but nature is leading us around by the nose’

Another terrific Einstein quote, and he has a whack of them in this vein, and again just by memory; ‘God doesn’t care about our mathematical convenience, he integrates empirically’.


#15          (see all posts) 2010/09/09 (Thu) @ 23:04

Tango

On the off chance that you’re still reading this thread ... do you agree with my explanation of Marcel in his simplest terms?

You can see how he is over-regressing (I hate the term ‘regressing’wink in the example, no?  And how the right skew is helping Marcel?  Simple stuff, really.

Of course he under-regresses (cringe!) things like BABIP.  Plus k=240, Marcel’s universal undersanding of nature, is awfully close to the spread of OBP (and presumably wOBA or any other linear weights model) so he’s swinging through the heart of the plate, even if he is blindfolded.


#16    DSMok1      (see all posts) 2010/09/09 (Thu) @ 23:42

@ Vic #14.

The “transformation” (wrong term, I’m sure) is the statistical method behind the use of the 5-4-3 or similar decreasing weighting.  The effects are the same, mathematically, but I prefer to look at it all in distribution terms.  Basically, the weights would look more like Oliver’s weights--1, 0.7, 0.7^2, 0.7^3, etc.

Basically, it accounts purely for normal year-to-year changes.  From a baseball perspective, I would presume the reasons for changes would be: aging, changes in team/park, changes in mechanics, changes in training, changes in role, psychological changes (management/teammates).  I prefer to roll these all together into a single distribution of changes, based on looking at year-to-year variation (it looks normally distributed to me).  I’m sure it could be possible to be more granular-- a distribution curve for variation from aging, a distribution curve for variation from changes in coaches, a distribution curve for changing teams, etc.  Probably too hard to derive each, though.


#17    Tangotiger      (see all posts) 2010/09/10 (Fri) @ 09:11

Vic, I’m not qualified to answer your first question.

And yes, while overall Marcel regresses the appropriate amount, he under-regresses things like BABIP, and over-regresses things like K/PA. 

***

He also should include things like park, have a more sensitive model for age, realize that the two conferences are somewhat distinct (especially for pitcher’s K/PA), and accept that minor league stats mean something.

Against all the purposeful ignorance is its intelligence of the three main parameters (simple regression, simple aging, and simple past performance weighting).  Given how well he does, adding the 4th to nth parameter will hardly make a difference in practice, even though logically it should.

It is fun to do though, and I’ve done it, but it is a huge effort in return for almost no practical gain (other than the journey itself).


#18    Tangotiger      (see all posts) 2010/09/10 (Fri) @ 09:16

, 0.7, 0.7^2, 0.7^3, etc.

I use 0.9994^DaysAgo for hitters.  For pitchers, I use 0.9990.  On an annul basis, this corresponds to 0.80 and 0.70, respectively.

Just a matter of doing a best-fit really.


#19    DSMok1      (see all posts) 2010/09/10 (Fri) @ 09:21

@ Tango/18

Yep, the results are the same whether you’re looking at it from the “best fit” side or from a “transformation” side.  If you’re trying to hang on to all of the information on the width of the expected distribution, approaching everything from the “transformation” viewpoint is helpful.  The mean of the distribution behaves just as the “best fit” setup you show does.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential