Friday, October 03, 2008
Complete Linear Weights, 2008
Colin provides his data for easy access, along with his intro article.
Buy The Book from Amazon
Colin provides his data for easy access, along with his intro article.
Just to make sure we’re on the same page, are you using:
PA - H - K
to represent outs and
H - 2B - 3B - HR
to represent 1B?
Outs should be AB-H-K.
I think that’s probably the issue here--the abv avg figures don’t pass the smell test. Sizemore +3.5? Anybody -45?
Peter is correct that the park factors are applied to the totals, not the marginals.
***
Peter is incorrect in expecting the same rank ordering for RAA. You expect the same rank ordering for RAA per PA.
A guy with +1 in 1 PA will rank way lower than someone with -1 in 600 PA, when looking at “total” runs created.
I also agree that +42 for Pujols doesn’t pass the smell test, not in the slightest.
Patriot’s right - ABs versus PAs. (Blog post and spreadsheet have been updated.) Now I get 79.4. runs for Pujols. Unsure of where the remaining discrepancy between Peter and I is. (The only thing I know of that he and I are doing differently now is that he’s using the weights rounded out to three places, whereas I’m using “unrounded” values.)
Colin - I am using PA -(Hits + Walks + HBP + SO + DP) for Outs as SO and DPs have been given their own linear weights.
Tango - I was incorrect about the rank order having to stay the same. However, I am not incorrect in criticizing adding a fixed value for every PA. It makes absolutely no sense for two players who have identical PAs and identical Linear weights above average to also have identical linear weights above 0 if one player makes more outs than the other. But by adding a fixed value per PA they will. This is clearly wrong.
Here is an illustration to show where Peter and I differ.
Player 1
0: runs above average
400: outs made
600: PA
Player 2
0: runs above average
450: outs made
600: PA
Obviously, the second player has alot more homeruns.
To get “runs created”, I simply add +.12 runs per PA, and I get both at 72 runs created. So, I get both players as: league average players, who came to bat 600 times each, and generated 72 runs each.
Peter on the other hand is suggesting adding say .36 runs per non-out PA. So, Player 1 gets an extra 72 runs to add to his zero to give him 72 runs created. Player 2 gets an extra 54 runs created.
Am I representing you correctly?
He may also instead want to apply a multiplier to the positive run values. In that case, we have this:
Player 1
0: runs above average
400: outs made
600: PA
120: positive runs
120: negative runs
Player 2
0: runs above average
450: outs made
600: PA
135: positive runs
135: negative runs
So, we add 0.6 runs created per positive run generated. Player A gets an extra 72 runs, to add to 0, to get 72 runs created. Player B gets an extra 81 runs to add to 0 to get 81 runs created.
I’m not sure which way Peter is advocating.
We cross-posted. However, “This is clearly wrong. “ I see it as “This is clearly right!”.
In any case, please tell me from my post 8 illustration which player gets more RC in your view.
Peter,
The correct way to do it would be to either use AB - H - K (as Patriot said) or to use:
PA -(Hits + Walks + HBP + SO)
given the way I constructed the double play weight. It’s not explicitly a seperate term, like the K term, because technically some Ks and CSs are double plays as well. I’m pretty sure I mentioned how I did that in one of my previous articles.
I used PAs and subtracted BBs and HBP rather than using ABs because even though you didn’t include them in your spread sheet I thought you were treating SF and SH as outs.
If you rounded to three digits rather than truncated it shouldn’t make much of a difference.
Colin: “The correct way to do” it would be to use exactly the events you used to generate the equation. As Peter noted, however you treated SH and SF to generate the LWTS, that’s how you should treat them for each player.
Colin - The heading for DPs in your spreadsheet is GDP. I interpreted that as Ground ball double plays. Since you didn’t give values for other double plays in your spread sheet I assumed that you would be properly charging the extra lost value against the runner instead of the batter.
All hitting data is taken directly from Baseball-Reference.com - I scraped it off the league pages yesterday.
SF and SH were ignored in generating the LWTS; presumably all were coded as “2” by Retrosheet. So then the correct term (or at least as close as one can come, given the data available):
PA - H - BB - IBB - HBP - K
That means coding certain plays (reach on error, fielder’s choice) as generic outs - I used all Retrosheet event codes to generate the weights. That shouldn’t be a problem on the whole - I adjusted the values of the events I ended up using to sum to zero on the dataset, to compensate for the missing data. But a player with high ROE totals will be (slightly) underrated.
Colin - Then I don’t understand how you are treating GIDP. Aren’t you using the value for DP that you have in your list of linear weights for them? If you are then you have to subtract them from the generic outs as well.
Peter, Retrosheet doesn’t have a seperate event code for the double play. The vast majority of double plays are coded as 2, or “Generic Out.” Some are coded as strikeouts - some are even coded as singles and doubles.
The value of the DP is the value beyond that of an ordinary out. This is consistent with how formulas like Extrapolated Runs and Estimated Runs Produced handle the double play.
Peter, in case you missed it, I replied in post 8…
So in calculating RAA from the spreadsheet you added the value -.587 * 16 to Pujols other totals for his double plays?
Retrosheet does have a double play flag so you can create a double play event category or a grounded into double play event category (using Batted Ball Type) for calculating linear weights. I thought that you had done that. Similarly, you can create a separate ROE event.
To get Grounded Into Double Plays with Retrosheet data, set the DP_FLAG to “T” and set the EVENT_TX to *GDP*, asteriks included. Actually you dont even need the DP_FLAG if you set the EVENT_TX to *GDP*
This
>Also, you can’t just multiply a park factor times a negative value (like below average RAA values) because it alters the RAA value in the wrong direction.
still seems to be a problem
One can also consider the possibility that the park impact should be additive to PA, not multiplicative to the runs created for reasons explained in the second half of the article here:
That’s how I derived my initial values for the DP term; then I subtracted the weighted average value of the underlying event from the DP term.
As for the park factor issue - I’m open to suggestions as to a fix for that. Would it be correct to adjust the inputs separately before applying the LWTS?
Since I already broke down the Retrosheet data by the official statistics, let me see if I can use Baseruns to generate LW and RC for 2008. I’ll use the empirical Linear Weights from the AL 1993-2007 and NL 1993-2007 to generate my Baseruns equations.
If you make the park impact figures additive, then it won’t matter when you apply it.
Let me see if I’m understanding this right. League average is roughly .12 runs per PA. So let’s say that league average at Coors Field is .16 runs per PA. (Which is a number I just made up.) So then the park factor would be -.04 per PA?
Actually it would be a pain to calculate LW for 2008 using Baseruns as I’m not at the computer that has all my Baseruns work on it.
Thanks Tango. I had missed your post #8 in the flurry of responses. It will take me a while to formulate my response. For some reasom I find this a difficult problem.
Colin: right. Unless you can show us that someone with .20 runs per PA is affected more than someone with .10 runs per PA.
Easiest ways to do it that I can think of:
* Take the R/PA for the league and figure each park’s R/PA using the multiplicative park factors I used.
* Parse the following out of the PBP data for each park:
R/PA(HOME) - R/PA(ROAD)
Any thoughts on this?
Also, I’m considering doing replacement level with this, and here’s what I’ve come up with so far, just doodling with pen and paper.
Tango’s replacement level is -2.25 wins above average, or -23.625 RAA, per 700 PA (you can get more specific by league). Still using .12 R/PA on average:
((700*.12)-21)/700 = .086 R/PA
Using the VORP baseline, .80% of league average, gives me .096 R/PA for replacement level. So instead of adding .12 (I actually use different values for league - they’re within .01 runs of each other, I believe, so .12 is close enough for explaining.)
Instead of adding .12 per PA to get Runs Above Zero, I should be able to add .086 (or .096) to get Runs Above Replacement, right?
The 2.25 doesn’t necessarily translate to 23.625. Depends on the run environment.
I use around 74-75% or so. Patriot uses 73%, which is what Clay uses I think. MGL might use 80%. Woolner may say he uses 80%, but if you add up his VORP, he uses something close to 75%, since his VORP matches Clay’s RARP.
Otherwise, yes, what you said.
As a rough guideline, we can use 75% for nonpitchers, 125% for starters and 105% for relievers.
For a 4.67 run per game environment (84 runs per 700 PA, 4.30 ERA), that sets the replacement levels as:
- 63 runs per 700 PA or .09 runs per PA
- 5.40 ERA for starter
- 4.50 ERA for reliever
The total runs above replacement per team:
nonpitcher: (84-63)*9= 189 runs
pitcher: (5.4*.65+4.5*.35 - 4.3)*162= 127 runs
Total RAR = 189+127 = 316
The nonpitcher/pitcher split is 60/40.
Aug 31 15:28
Fans Scouting Report: Update
Sep 02 15:02
Mail: rWAR v fWAR
Sep 02 14:59
Roger Federer
Sep 02 14:59
It’s hard to beat the crowd (Vegas in this case) no matter how smart you think you are
Sep 02 14:57
Could Rob Dibble have been a comp for Strasburg?
Sep 02 14:15
WOWY Teachers
Sep 02 13:37
Who’s Waldo?
Sep 02 08:36
Team Elin
Sep 02 01:19
Can someone tell me why Trevor Hoffman is still allowed to pitch?
Sep 01 23:16
Strasburg II
Colin - I used your linear weights and spread sheet and tried to replicate your RAA values and got numbers that were significantly larger than yours (63.8 instead of 42.7 for Pujols). I am in the process of double checking but you may want to as well. Especially since it seems like Pujols should have closer to three times as many RAA as Manny instead of less than twice as many.
Which brings me to another point. You didn’t bother to sum Manny’s and Teixeira’s production for both teams they played for( and others that played for more than one team) and that left them out out of the top ten where they clearly should be.
Also, you can’t just multiply a park factor times a negative value (like below average RAA values) because it alters the RAA value in the wrong direction.
Finally, the changes in rank order from RAA to ABS demonstrates the absurdity of Tango’s decision to add a fixed run value to all PAs. This should be a simple transformation with no changes in rank order. The only proper method is what I have arguing for all along, adding a positive run value to only the batting events that don’t result in an out.