THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Monday, September 29, 2008

End of year Sabermetric stats

By Tangotiger, 03:21 PM

Courtesy of Patriot.

I don’t really have much to add.  Patriot noted that he uses a 73% offensive replacement level, likening it to a .350 OW%.  Using PythagenPat and 4.5 RPG per team, I get .364.  No biggie.  Just wanted to point out that he should probably be saying .360 not .350.  However, what if you look at it as one replacement guy with 8 average guys?  In this case, this team will win .486 games, making our replacement level -.014 wins per game (or more accurately per one-ninth of a game slice).  Adjusted to a per game basis, that’s -.014 times 9 equals -.126, or a .374 win%. 

That is, rather than presuming 9 replacement-level hitters with a team of average defense, we presume 1 replacement-level hitter, 8 average hitters, and average defense.  That gives you a .486 win%.  The marginal impact is .014, which you “annualize” by multiplying by 9.  Kinda like ERA for relievers.  Anyway, to get it to my replacement level, I’d use 74% or 75%.  We’re pretty much in agreement here.

With starters, if we repeat this process, but presume 5.4 IP per replacement start, and the bullpen gives him average support, then Patriot’s 125% gives you a starter win% of .390.  To make it .380, you’d want 1.27 or 1.28.  So, 125% is perfectly fine.

For relievers, it’s the same process as hitters, if you presume 1 IP per replacement relief.  You’d want 106% or 107% of league average.

Anyway, basic core agreement, with just a smidge of disagreement on the peripherals.


#1    Patriot      (see all posts) 2008/09/29 (Mon) @ 16:24

You are absolutely right about the OW%; it’s around .350 if you use a fixed exponent of 2, which we both know is wrong but I’m still in the habit of using 2 as the default value for an unspecified run context.

If anyone is solely interested in the results, don’t bother clicking the link, because they won’t be there for a couple days at least.


#2    MGL      (see all posts) 2008/09/29 (Mon) @ 20:09

Very nice job by Patriot.  Very nice.  Don’t bother reading the whole article (an explanation of his various methodologies) unless you have lots of time and a good background in these kinds of sabermetrics though.

One minor quibble, if you can even call it that, with the PF’s.  While it is nice that he uses multi-year PF’s (up to 5 years, as he states) and then regresses them, I don’t like the idea of blindly regressing all PF’s to 1.00.  We know lots of things about parks and we shouldn’t be regressing them to 1.00.  That would be like regressing every hitters’ HR rate to league average, whether they be Eugenio Velez (about 150 pounds, soaking wet) or Prince Fielder (about 250 pounds, after spending a week in a sauna).

There are lots of things he can do to use different means for the regressions.  For example, compute the total area of the park and then assign different means to different categories of size (say .98, 1, 1.02 for a large, medium, and small park).  Do the same for altitude or average temperature.  Wing it.  Etc.

For example, do you really want to regress Coors Field sample PF to 1.0?  Or even TEX stadium?  Of course not.

No big deal, but he will end up “compressing” the PF’s a little too much and unnecessarily so. 

(When I do my component park factors, I do indeed use the “category method.” I use the foul territory size to regress foulout rates toward different means.  The area of the OF sections and fence heights for HR and Dbls rates.  The altitude and average temp for HR and Dbls rates.  Etc.)


#3    Patriot      (see all posts) 2008/09/29 (Mon) @ 20:38

I’m certainly not going to disagree with MGL on regression.  It would be preferable to use the park’s characteristics as part of the process.

In lieu of evidence to the contrary, though, I think that park factors with regression to a mean of 1 are better than park factors with no regression (particularly for less data, of course).  And, again, inferior to those using different means.

Also, when I apply the PFs, I round to two decimal places.  Going back to last year’s PFs, there were 28 parks for which I was using five years of data (there were probably some dimensions changes that I missed somewhere along the line).  When rounding to two places, only 9 of the parks have the regression change the PF, and all by .01 towards 1. 

I suppose that another refinement could be to weight the recent data more heavily; I weight all five years equally.


#4    MGL      (see all posts) 2008/09/29 (Mon) @ 22:50

I don’t see any reason to weight the years for the PF’s.  Unless there were some subtle changes in the parks (like when PHI and SD slightly changed the OF dimensions), and even then it is probably not worth it and there are better ways to do it.  Or global warming trends.

Anyway, I agree that using 1.00 to regress towards is just fine. It just rubs me a little the wrong way to regress the COL PF to 1.00 when I KNOW that the true PF cannot be 1.00.

BTW, how did you come up with the regression amounts for 1,2,3,4, and 5 years of data?


#5          (see all posts) 2008/09/29 (Mon) @ 23:04

5,487 words of nerdy goodness.


#6    Patriot      (see all posts) 2008/09/29 (Mon) @ 23:38

Believe it or not, they are based on a rule of thumb offered by you on FanHome many years ago.  So maybe the better question is how did you come up with the regression amounts? grin Or why haven’t I studied this myself?

I am actually shocked that I did not mention that in the post.  On my webpage, the explanation of the stats (see link) does include an acknowledgment that your rule of thumb was my source.

This is what you posted then:

Here’s a decent rule of thumb set of formulas for regressing. For 1-year stats, true PF(TPF)=1-(1-PF)*.6, 2-year stats, TPF=1-(1-PF)*.7, 3-year stats, TPF=1-(1-PF)*.8, and for 4-year or more stats, TPF=1-(1-PF)*.9.


#7    terpsfan101      (see all posts) 2008/09/30 (Tue) @ 00:59

Is MGL’s regression equation for the actual “Park Factor” or the “Park Adjustment” that you apply to the stats. For instance, if Fenway has a PF of 1.10, you apply a “park adjustment” of 1.05.

The biggest gray area with Park Factors is knowing what constitutes a “subtle change” as MGL put it. KJOK’s Park Database has dimensions for each park, but I noticed his database misses a few of these changes. For instance, Comiskey Park moved their fences in 5 or 6 years ago, and KJOK still has the original dimensions for the New Comiskey Park. I compiled a list of ballpark changes from my old Total Baseball CD-ROM and Ballparks.com a few years ago. If I ever find it, I’ll combine my work with KJOK’s.


#8    terpsfan101      (see all posts) 2008/09/30 (Tue) @ 02:08

The regression gets applied to the park factor. I don’t know why I was confusing this with the park adjustment factor.


#9    Patriot      (see all posts) 2008/09/30 (Tue) @ 10:09

It doesn’t matter.  Taking your example, if we regressed 1.1 we would get

1 - (1-1.1)*.9 = 1.09, for an “adjustment factor” of 1.045.

Or we could take

1 - (1-1.05)*.9 = 1.045

Of course, 1 - (1-PF)*x can be simplified to x*PF + (1-x).


#10    MGL      (see all posts) 2008/09/30 (Tue) @ 11:36

I think I just looked at y-t-y correlations to approximate the regression amounts.

The White Sox changed the dimensions at US Cellular in 2001.

I usually use the Clem or Munsey web sites for dimensions. Munsey usually catches all the changes but not always.

I don’t think the subtle changes are all that big a deal, and I definitely would not use a weighted average (thus reducing effective sample size) for any park.  There is NO reason to do that for a park that has not changed.  If, for example, I have a regressed factor of .92 for Petco and they move a part of the fence in (6 feet in RC), as they did in 06, I make it .93 or .94, or something like that.  Sure, after 09, you might use only 4 years of data rather than 5, including 05, but it depends on how subtle or not the change is.

Sometimes when a park changes its dimensions considerably, you have to treat it as a whole new park.  For example, when the Dodgers completed the new seats behind home plate, thus drastically reducing the amount of foul territory, it went from one of the best pitchers parks in BB, pre-renovation, to near neutral, after the renovation was completed.  For that, you either have to treat it as a new park, make some serious adjustments, or weight the years, I guess.


#11    Tangotiger      (see all posts) 2008/09/30 (Tue) @ 11:45

I just want to say something about player representation.  It is not a given that each park has a representative sample of the population of MLB players.  Indeed in SF, we know that the population of LH power hitters was severely overrepresented.  And, we know that in the 1970s/80s, there were gazelles abound.  The % of players that are RHH or switch hitters or LHP changes over the years.

And so, when we look at 1yr or 5yr or 20yr park factors, there could be a reason that you would want to over or underweight those years.

The players in that park in those years are not necessarily representative of the players who do play in 2008 in that park.


#12    terpsfan101      (see all posts) 2008/09/30 (Tue) @ 15:23

Thanks for the Park Factor advice guys.


#13          (see all posts) 2008/09/30 (Tue) @ 18:47

I wrote about park factors a few months ago at Seamheads.
Article http://seamheads.com/blog/2008/06/10/truer-park-factors/
Table of results http://seamheads.com/db/ML_Park%20Factors_1954-2007.xls

MGL - when changes are made to a stadium, I treat it as new, by adding a “version” field.

Tango - I used matched pairs of ballparks, year by year, from the Retrosheet pbp, so I hope that addressed the problems of which players were producing the stats.

In the article, there’s a section where I did an rms test for the NL 1985-1991, a seven year period when there were no changes in teams, schedules or ballparks. I used the seven year totals as true, and then compared to that single year values, any two consecutive years, and any three consecutive years. I believe it would be valid to select any random combination of years as well, but at the time this was in Excel, so I had limitations. Anyway, there’s a table that shows the rms for each component based on the sample size in units of seasons.

In the example of Petco, I have that they moved in CF 9 ft, from 411 to 402. The v1 (2006) HR pf was 0.74, and the v2 (2007) HR pf was 0.90. Both had a sample size of 1 season, for which I have a rms of 0.15. When I add in 2008, I’ll have a better estimate for the current v2.


#14          (see all posts) 2008/09/30 (Tue) @ 19:56

Checking raw hr totals for Petco, not as a pct of balls hit fair, there were 136 at home, 183 on the road, factor of 0.74. Combined with 0.90 from 2007, then v2 is approx 0.82 for two seasons. v1 was 0.74 for one season.


#15    Tangotiger      (see all posts) 2008/09/30 (Tue) @ 21:40

Brian, why not match by ballplayers as well, similar to how I did the HR year-by-year factors?


#16          (see all posts) 2008/09/30 (Tue) @ 22:06

I’ve thought of that.

My original make table query from the Retro pbp did not include batters or pitchers, it was adhoc for park factors. I thought later that by adding batter, bat hand, pitcher and pitch hand to the same query I could have everything I need for all the common splits, plus take another look at park factors.

After I have done this, one of the things I’ll be looking at is HR park factors broken down by HR% of the batter, probably a TTL estimate for that season. Theory is that high HR% batters are not as affected by the ballpark, because they hit the ball further.

So many ideas, so little time...and I’ll probably be back on overtime after Thanksgiving.


#17    MGL      (see all posts) 2008/09/30 (Tue) @ 23:34

After I have done this, one of the things I’ll be looking at is HR park factors broken down by HR% of the batter,

Careful with this!  You are going to have all kinds of selective sampling problems with that approach.  Players with high HR% are going to tend to come from high HR parks in the first place, etc.

I have tried to do this kind of thing for years and have never had much success.

On the other hand, I and other researchers have shown that there is a tendency for parks to effect HR both in an additive and a multiplicative manner.  IOW, if a player normally hits 5 HR (in all parks) and he plays in a high HR park, he gets another HR plus a percentage boost.  Now whether that means that a high HR park (or low one) is “better” or “worse” for a high or low HR hitter is a matter of semantics.

Plus, what we really want to do is to see how parks affect fly ball distance and then apply that to a player’s park neutral fly ball distribution. Even that does not work exactly, as players change their approach, and pitchers change their approach against certain batters, in different parks.  For example, if you pitch to Jason Kendall in any park, you don’t fear a HR so you pitch him normally.  If you pitch to Dunn, you might pitch him quite differently in SF on a cold night and in Wrigley with the wind blowing out.


#18          (see all posts) 2008/10/01 (Wed) @ 00:41

Well, it’s something I want to play with and see if any trend can be found.

Some preliminary work suggested that players with high HR% had a smaller home/road ratio than all players.

In an anecdotal case which illustrates the principle, in 2007 Jimmy Rollins hit 30 HRs, none longer than 410 feet. Prince Fielder hit 51, 30 longer than 410 feet. If we had a hypothetical ballpark 410 all around, Fielder would keep 60% of his HRs, Rollins 0%.

I have discussed this with Greg at hittracker, and may have some part in his now listing how many parks each homer would have been out of. Also, Colin ran some numbers for me in order to determine the GameDay hit location coordinates for each ballpark, and that has progressed well.

So, slowly taking looks from several angles, but right now I want to know what’s the most accurate way to apply a homerun park factor to a batter.


#19    terpsfan101      (see all posts) 2008/10/10 (Fri) @ 02:22

I had a couple of questions about calculating Park Factors.

1. I used to calculate the “Innings Pitched Corrector” (IPC) using the method described in Total Baseball. If the IPC was greater than 1, than estimated innings on the road were higher, and if less than 1, estimated innings at home were higher.

Seeing that you guys (Tangotiger, Patriot, Brian, MGL) don’t use it, I guess that I won’t use it either. Unless you guys are using Runs/Inning for the years that we have PBP data. Then, there is no need for an IPC. Unfortunately I deleted all my PBP data after I compiled it.

2. I’ve done this next part before, but I wasn’t happy with the way I did it. I used a new park factor when the park changed dimensions and it affected run-scoring. This was determined in an almost purely subjective fashion. I’d like to be a little more objective this time around. Say a team only moves their RF fence from 340 to 335 and nothing else changes. Should I use a new park factor here? Is there some rule of thumb I can follow? Like 5 ft., no change in PF; 10 ft., new park factor.


#20          (see all posts) 2008/10/10 (Fri) @ 02:50

I’m not familiar with the innings pitched corrector. So far I’ve been working with batting data, but will soon be processing pitching, so it might be something I need to be aware of.

As far as if a change is important, assume it is and run the query to see if there’s an effect. Logically, I would say that 340 to 335 probably would because that’s an area where balls are likely to be hit. The White Sox park had a large upward move in it’s HR factor by pulling in the corners. In contrast, Miami moved in their cf 10 or 20 feet, from huge to really big, and didn’t seem to make any difference, as there probably just weren’t any balls hit that far to that location.


#21    terpsfan101      (see all posts) 2008/10/10 (Fri) @ 15:45

The rule that I’m following is that any change in dimensions, will result in a new Park Factor. So if a team moves a fence back 2 ft., I’d use a new Park Factor.


#22          (see all posts) 2008/10/10 (Fri) @ 18:11

Theoretically, yes, but practically it takes about 3 seasons of data to get an accurate reading on all componets. If there’s a change, test it, but you may have to overlook some smaller changes to avoid having a string of one year factors when it’s not absolutely necessary.


#23    terpsfan101      (see all posts) 2008/10/10 (Fri) @ 20:15

OK, I’ll look at the minute changes more closely.


#24    terpsfan101      (see all posts) 2008/10/10 (Fri) @ 21:36

Brian,

Do you mind sharing your Park Versions for years prior to 1954?


#25          (see all posts) 2008/10/11 (Sat) @ 00:56

I can send that to you.

I started with the KJOK ballparks database, and proofed it with other sources, catching some changes that were not noted.

However, since I used Retro PBP for the factors, I don’t have any of those prior to 1954. I will update with the latest Retro releases as soon as I have a chance.


#26    terpsfan101      (see all posts) 2008/10/11 (Sat) @ 03:19

Thanks for sending the table Brian. It will save me a great deal of time.


#27    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 04:25

It took me 4 days to calculate Park Factors for every team from 1871-2008. Most of this time was spent assigning PF Versions. A team would get a new PF Version when they moved to a new stadium or the dimensions of their current stadium were significantly altered. Brian Cartwright’s Configuration table was a big help here. I calculated PF’s for Runs and HR. The denominator was Games. I probably should of used Outs for the Retrosheet Years. However, I would of then been mixing Outs and Games when a PF version overlapped into both the Retrosheet and Non-Retrosheet Years. I considered using Total Baseball’s Innings Pitched Corrector (IPC) as well, but decided against it. The PF’s were regressed using MGL’s regression equations.

Once I get everything organized, I’ll post the results on Google Docs. I’ll also post the formulas I used as well. If you haven’t figured it out yet, I am very unorganized, especially when it comes to presenting data. I will try to keep things minimal, but I like to present all of my work for credibility purposes.


#28    Tangotiger      (see all posts) 2008/10/15 (Wed) @ 10:15

You have to use Outs, even if it means mixing games and outs for the overlapping years.  You use the best information you can.

And the correction factor would be a plus.  Whether that correction factor is the best one, I don’t know.  It would be easy enough to test, by simply running it against the Retrosheet years, and see if there is a bias or not.


#29    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 14:24

The formula for the IP Corrector is:

(18.5—Wins at home / Games at home)
-----------------------------------
(18.5—Losses on road / Games on road)

“If it is greater than 1, this means the innings pitched on the road are higher because the other team is batting more often in the last of the ninth.”

“Note: 18.5 is the average number of half-innings per game if the home team always bats in the ninth.”

So I guess the best course of action would be to use Outs from 1956 to 2007. Then use the IPC for pre-1956 seasons and 2008.


#30    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 15:38

The reason I didn’t use the IPC was that I didn’t understand the formula. If somebody could explain to me what the formula is doing, then I will look into testing it and making it more accurate. Is it a hypothetical formula? Is there even such a thing as a hypothetical formula?


#31    Tangotiger      (see all posts) 2008/10/15 (Wed) @ 16:36

The idea is that if you win at home, then you won’t bat in the bottom of the last inning that much.

So, say that you win 50% of your home games.  Then, in the games that you lose, you will come to bat almost one more inning than in the games that you win.  “Almost” because sometimes you win in the bottom of the inning with 1 or 2 outs.

Really, all you need to do is run a regression on:
parm1 = outs in game when at home
parm2 = win or loss at home

So, you can have:
outs, winloss
54, 0
53, 1
58, 1
51, 1

etc, etc.

Do the same for when on the road.

Should be pretty straightforward to be able to infer number of outs made per game, based on whether you won a game or not.


#32    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 16:58

Tangotiger,

I should be able to group the data in the way you have suggested. Once I group the data, perhaps you could help me with the regression aspect.


#33    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 18:00

Here are the Home and Away W-L records grouped by OUTS/G:

http://spreadsheets.google.com/pub?key=pzy9IhjJPqavPXI6WTAwoeg

Note: Data was taken from Retrosheet Game Logs from 1956 to 2007.


#34    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 18:20

Ignoring tie games, here are the results:

OUTS/G: 53.70
OUTS/HOME WIN: 52.38
OUTS/HOME LOSS: 55.25
OUTS/AWAY WIN: 55.25
OUTS/AWAY LOSS: 52.38


#35    terpsfan101      (see all posts) 2008/10/15 (Wed) @ 18:35

Ties games should of been included:

OUTS/G: 53.70
OUTS/HOME WIN: 52.38
OUTS/HOME LOSS: 55.25
OUTS/TIE: 52.72
OUTS/AWAY WIN: 55.25
OUTS/AWAY LOSS: 52.38

I guess that these will be the numbers I use to estimate innings for pre-Retrosheet years.


#36    terpsfan101      (see all posts) 2008/10/16 (Thu) @ 06:38

Park Factors 1871-2008:

http://spreadsheets.google.com/pub?key=pzy9IhjJPqasyNfGRqHZrUQ

Raw Park Data 1871-2008:

http://spreadsheets.google.com/pub?key=pzy9IhjJPqavYvMm-3w5-Rg

Any input on the methods I used here would be greatly appreciated.


#37    terpsfan101      (see all posts) 2008/10/16 (Thu) @ 17:37

I was thinking of expanding on MGL’s regression equatons. Obviously, you wouldn’t want to regress a 4 year PF the same amount as a 20 year PF like I did. And I should probably regress some of the 19th century PF even more.


#38    terpsfan101      (see all posts) 2008/10/16 (Thu) @ 18:23

I’m implementing a single season IP Outs Corrector as well.


#39    Tangotiger      (see all posts) 2008/10/16 (Thu) @ 18:37

The way you do a regression is simply add a park factor of 100 for x number of years, for each park.  Whether x is 1 or 3 or 10, you have to figure that out.


#40    terpsfan101      (see all posts) 2008/10/16 (Thu) @ 19:20

Your method would obviously be much more accurate than what I just did. I just tacked on .01 for each additional year after 4 years using MGL’s equations:

5 yrs: 1-(1-PF)*.91
6 yrs: 1-(1-PF)*.92
13 yrs: 1-(1-PF)*.99
14+ yrs: No regression

I don’t know why regression is such a difficult concept for me to understand. It really doesn’t sound all that complicated.


#41    Tangotiger      (see all posts) 2008/10/16 (Thu) @ 19:42

If you add 0.5 years, then you get this amount of regression for your years:

5 yrs: 0.5/5.5 = 9%
6 yrs: 0.5/6.5 = 8%
13 yrs: 0.5/13.5 = 4%

You have to figure out how many years of 100 PF to add.  0.5 simply best-fits your data.


#42    David Gassko      (see all posts) 2008/10/16 (Thu) @ 20:45

Based on MGL’s numbers, sounds like you want to add about 0.67 years of average park factor.


#43    terpsfan101      (see all posts) 2008/10/16 (Thu) @ 23:10

Yes, MGL is using 0.67 years of avg park factor. Say we have a PF of 1.5:

(1.5+0.67)/1.67 = 1.3

or

1-(1-1.5)*.6 = 1.3


#44    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 01:04

Tangotiger,

“Whether X is 1 or 3 or 10, you have to figure that out.”

Do I look at the variance of PF grouped by number of seasons to figure out X?


#45    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 06:15

After how many seasons of data would regression not be necessary for PF’s? If someone could give me an educated guess on this, then I could just adjust MGL’s regression equations.


#46    David Gassko      (see all posts) 2008/10/17 (Fri) @ 12:10

You always regress. Just fter 1 year, you regress 40%, whereas after 10 (using Mickey’s equation), you regress 6%, and after 100, 0.7%. But you always regress.


#47          (see all posts) 2008/10/17 (Fri) @ 17:04

The easiest way to program the regression is to take the actual performance and add to it a fixed number of league mean performance. The amount of league mean to use can be calculated from population size, mean and variance, or empirically testing for the values which minimize rhe total error.

Say that for SO%, we add 100 PAs. If a player has 100 PAs himself, then 50% of the result is from him, 50% from the league (we don’t know much about the player, so we assume he will tend to be league average). If the player later has 900 PAs, then we do know much mor about him, so 90% of the result will be his, 10% league. After 3000, it’s over 97% player.

Even with BABIP, with takes in the neighborhood of 1000 PAs of regression, if a player can maintain an outlier performance for several seasons, he can still keep 80% or so of his performance after regression.

Always regress, but only once. Don’t regress each year and then add them up. Add them up first, then regress.


#48    Tangotiger      (see all posts) 2008/10/17 (Fri) @ 17:11

What Brian said.


#49    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 17:33

David,

I appreciate your help here. How did you figure out that you regress 6% for 10 years and 0.7% for 100 years?


#50    David Gassko      (see all posts) 2008/10/17 (Fri) @ 17:53

I did what Brian said, roughly. If we’re adding .67 years of an average park factor, regression to the mean after 10 years = 1- 10/(10 + .67) = 6%. After 100 years, it’s 1 - 100/(100 + .67) = .7%.


#51    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 19:19

Thanks to you guys, I think I have figured this out. Using MGL’s 4+ yr regression equation, you add (4/9) of avg. PF for each season after the 4th yr:

1-4/(4+(4/9)) = .90 regression

Adding (4/9) for 5 years we get:

1-5/(5+(4/9)) = .918 regression

For 10 years we get:

1-10/(10+(4/9)) = .957 regression

Thanks again for the help. Once I update the regression values in my database and re-run the queries, I’ll update the PF spreadsheet I posted on Google Docs.


#52    tangotiger      (see all posts) 2008/10/17 (Fri) @ 20:32

You need to add the SAME number of years of PF, for ANY group of years.

If you have 4 years of PF, you add 0.5 years of 100 PF.  If you have 40 years of PF, you add 0.5 years of PF.

You don’t even need to say how much regression you are adding as a rate.  You simply add the 0.5 years of 100 PF as if that actually happened.

If you are implying that MGL adds different number of years of PF for different total number of years of a park, then I’d like to hear him explain it.

If you have Pujols with 600 PA, you add 200 PA of league average performance.  If you have Pujols with 2000 PA, you add 200 PA of league average performance.  If you have Pujols with 10,000 PA, you add 200 PA of league average performance.

Now, if you NEED to express the amount of regression, that would be 200/800, 200/2200, and 200/10200, respectively.  It’s not necessary to calculate the rate, since simply adding in the 200 PA as I’m showing it is sufficient.

BTW, 1 minus the above rates is the “reliability” figure in Marcel.


#53    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 21:23

For the 4 equations MGL gives, this is the amount of avg PF being added:

Yrs, Avg PF Added
1: 2/3
2: 6/7
3: 3/4
4+: 4/9

For years greater than 4, MGL is adding the same amount of avg PF for each year. However, for years 1 through 3, the amount of avg PF being added changes.

I have no idea why MGL chose to add different amounts of PF for years 1 through 3. I’m sure he had a good reason for doing it. We should also keep in mind that he gave these formulas as “a decent rule of thumb.”

Anway, I trust MGL’s work and his precision probably more than I do any other sabermetrician.


#54    terpsfan101      (see all posts) 2008/10/17 (Fri) @ 23:17

I now see Tango’s, David’s, and Brian’s point that the regression should be a fixed number. That is why we call it “linear regression.”


#55    terpsfan101      (see all posts) 2008/10/18 (Sat) @ 13:01

I updated the PF spreadsheet in post 36 with the new regression amounts. I also fixed an error in the way I was figuring out the Park Adjustment Factor.


#56    Colin Wyers      (see all posts) 2008/10/18 (Sat) @ 14:11

You’re confusing a linear model with regression to the mean - a linear regression is called such because it’s a linear equation.

The reason that you regress by a fixed number (using this method) is because as the number of PAs (outs, games - whatever your denominator is) goes up, the total amount of regression goes down.

Essentially

(1.5+0.67)/1.67 = 1.3

is a simplified form of

(1*(1.5)+.67*(1))/(1+.67) = 1.30

or a weighted average. If you have two years of park factor data, you would do:

(2*(1.5)+.67*(1))/(2+.67) = 1.37

You hold the regression amount constant, and it becomes smaller as a proportion of the overall years in sample.


#57    terpsfan101      (see all posts) 2008/10/18 (Sat) @ 17:03

At least now I understand why you use regression and how you implement it. Though, I still wouldn’t know how to calculate regression, what variables you would use, and to what degree you would want to regress.


#58    Colin Wyers      (see all posts) 2008/10/18 (Sat) @ 17:25

You too?

I belive the method Tango uses (at least as a shortcut) is to figure the year-to-year correlation of a measure (as r) at x number of opportunites, and then figure:

x * (1-r)/r

Someone can correct me if I’m wrong.


#59    tangotiger      (see all posts) 2008/10/18 (Sat) @ 18:31

Right.

So, let’s say that if you have 200 PA in one time period and another 200 PA in another time period, and the correlation between the two samples for K/PA is r=.667, then we know that r=.500 at PA = 200*.333/.667 = 100

And this becomes
r = PA / (PA + 100)

So, at PA = 200, correlation is .667.  At PA = 100, correlation is r=.500

That’s why all we need to do is report at how many opps r=.50, and we have a general regression equation for anything you want to do.  It’s as easy as that.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Nov 20 01:43
Sabermetric Moves of the 2009 Pre-Season

Nov 20 04:02
Nate Silver: hero to interviewers

Nov 20 02:01
My 1B is better than your 1B

Nov 20 00:26
MLB logo

Nov 19 23:03
NBA’s Marcel

Nov 19 19:13
Offense by position groups by decade

Nov 19 17:32
Changes in home run rates during the Retrosheet years

Nov 19 16:40
One Year and One Million Hits Later

Nov 19 16:22
Soria as a starter?

Nov 19 13:50
Response of a fired head coach