Monday, May 04, 2009
Additive park factors
Justin presents Patriot’s park multipliers as park addends. I’m pretty sure the correct answer will end up being odds ratios, treating them as addends should be done just as often as multipliers.
Buy The Book from Amazon
Justin presents Patriot’s park multipliers as park addends. I’m pretty sure the correct answer will end up being odds ratios, treating them as addends should be done just as often as multipliers.
I think that odds ratio will get you much closer to the the correct way to apply a park factor to players than using an additive method that Justin sort of advocates.
I also think that if you have a choice only between multiplicative and additive, the former is the best choice by far.
As much as I don’t like regression models for these sorts of things, I’d like to see one for predicting player stats, similar to the one that Dackle shows us above. That would give us a better idea as to whether additive, multiplicative, odds ration, or some combination is best. I think you would have to play around with a best fit equation though…
Good job.
Dackle are you actually producing a park+league factor?
The mean for his AL teams is -.03 runs, and -.35 for the NL teams. So, if you want to get the mean to be 0 for each conference, subtract each team by the appropriate baseline. e.g., Texas is +.68 and San Diego is -.90.
***
Also, as discussed in the past, we should not look at runs, but rate state, like OBP. The reason is that “runs” is really a combination of OBP and SLG, and it works in a non-linear fashion. If LOB counted as partial runs, we wouldn’t have this issue.
That is, if a park adds +.01 HR, 3B, 2B, 1B, BB per PA, the net effect is not half as many runs as a park that adds +.02 of those events per PA.
So, only for runs per game do you actually want to use the multiplicative method. For the component stats, the additive method is better.
That is, if a park adds +.01 HR, 3B, 2B, 1B, BB per PA, the net effect is not half as many runs as a park that adds +.02 of those events per PA.
So, only for runs per game do you actually want to use the multiplicative method. For the component stats, the additive method is better.
Well, crap.
So, your recommendation would be to either
a) use additive component factors, plus some kind of adjustment for PA’s in hitter vs. pitcher parks.
or
b) convert RAA data to absolute runs, apply the multiplicative park factor, and then convert back to RAA?
Just trying to figure out how to do this in an acceptable way.
-j
Dackle are you actually producing a park+league factor?
Tango, yes, I just dumped the scores for every game last year into R, so it includes the interleague factors.
Another nifty thing you could do is dump in five years worth of games, and if the park changes then it’s considered a new park. In fact you could dump 30 years worth of game scores into R and maybe have 50 or 60 parks, and it would accurately compute the park factor relative to all other parks during its existence.
I don’t think that the additive method for components works better than the multiplicative one. For example, I tested it for home field advantage and the multiplicative (which is essentially the same as an odds ratio) method worked much better.
Brian Cartwright recently did work on this, showing park factors by player quality, and the additive ones turned out better.
We had a thread on this a month ago…
I remember Tango, and the analysis he did is fraught with problems. I can easily show how multiplicative or odds ratio is not only better but much more appropriate, which I guess is the same thing. I’ll see if I can put together some numbers later on tonight…
mgl, is it mine that is fraught with problems? I do welcome constructive criticism. I have not yet finished that study - I see the effects on homeruns (which I believe are a unique category) but I have not yet been able to find the math which will give me a useful way to say how much each player’s HR rates are effected by each park, by knowing how each and every batter performed in each park.
I prefer using log5, which is most similar to odds ratio.
Brain, what I meant to say was that any analysis that attempts to see how each park affects different classes of talent (like hi, low, med HR) is difficult and fraught with potential problems - not your analysis necessarily.
For example, let’s say that you looked at players who had hi HR in one year and you wanted to see how they were affected by parks with hi and low HR park factors. For one thing, those players will tend to play in hi HR parks in the first place, so you would want to use “out of sample” testing. Another problem is that when you test low HR (or low triples, for example) players, you have lots of players who hit zero HR’s and that really screws things up.
The Odds Ratio Method is certainly the best way to go.
Given the choice between the two inferior methods (Multiplicative, Additive), Additive is likely the much better approach. Yes, you pay a price for it with guys with 0 or close to 0 HR in a HR park (Juan Pierre, Coors).
Feb 12 02:23
Psst… wanna intern in Canada?
Feb 12 01:57
Who is Jeremy Lin?
Feb 12 00:40
Clutch analogy
Feb 12 00:38
Reader Mail of the Day: Why do we need X years of fielding data? And what about outliers?
Feb 11 20:11
Fighting leads to goals?
Feb 11 19:55
Why do players get crappy caps?
Feb 11 19:12
Hero of the month: Brittney Baxter
Feb 11 17:59
MGL: Today on Clubhouse Confidential
Feb 11 10:29
Dwight Evans
Feb 11 02:12
Performance through the ages
Baseball Hacks (page 400) has a nice method for computing power ratings and park factors using linear regression. I haven’t tried it in Excel, but I think you’d do it something like this—columns 1-30 for the offence, columns 31-60 for the defence and columns 61-90 for the park. Column 91 is the runs scored. Every game gets two rows. So if the Blue Jays beat the Orioles 4-3 in Toronto, then the Jays offence + the Orioles defence + Toronto’s park = 4 runs, and the Orioles offence + the Jays defence + Toronto’s park = 3 runs. So, in the first row, you’d put a 1 under Toronto in the offence section (rows 1-30) and zeros for the other 29 teams, a 1 under Baltimore in the defence section (rows 31-60) (and zeros for the other teams) and then a 4 in row 91. Do the same for the defensive side of the game (ie Orioles offence + Toronto’s defence + Toronto’s park = 3 runs). Repeat the process for all of the league’s games in the season, and then run your regression (although probably easier to read the article in Baseball Hacks and install R on your computer).
The nice thing about this approach is that it takes care of schedule effects and the problem of the team not playing road games in its own park. Baseball Hacks calculates a multiplicative park factor, but I skipped the step of converting runs scored to logs, so the numbers presented below are additive park factors.
Here are the rankings by league, with the 2008 park factors from the Bill James handbook for comparison (based on the classic runs scored at home by both teams divided by runs scored on the road). Note that the program uses Anaheim as a base, so the Angels park is rated 0.00 and every other team is scaled off that number.
AMERICAN LEAGUE
In the traditional approach, the park factor for a team like the White Sox is boosted because they play a disproportionate number of road games in pitchers’ parks (eg 18 games last year in KC and MIN, another 3 in Dodger Stadium), whereas a team like Toronto faces the opposite problem (27 games in Bos/NY/Bal, 10 in Texas and Chicago vs 3 “road games” in Tex/Chi for the White Sox—ie Chi doesn’t play a road game in its own park).
On a related note, if anyone is looking for the proper regression factors, I just wanted to throw this method out for consideration—dump into Excel all of home and away runs scored and allowed data from Lahman, work out the simple park factors for each team/year, and then use LINEST to predict next year’s park factor based on this year’s park factor. I did this a few years back rather hastily, but came up with the following (year1 is last year (eg 2008), year2 is two years ago (eg 2007) etc):
Based on one year of data: Park factor = (.519 x Year1) + .482
Based on two years of data: Park factor = (.367 x Year1) + (.291 x Year2) + .343
Based on three years of data: Park factor = (.319 x Year1) + (.228 x Year2) + (.167 x Year3) + .286
So, if the simple park factor was 1.12 in 2008 and 1.07 in 2007, the regressed park factor for 2009 would be (.367 x 1.12) + (.291 x 1.07) + .343 = 1.065. Using only one year of data, the regressed park factor would be (.519 x 1.12) + .482 = 1.063.