THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, December 09, 2010

Tangotiger challenge of the day

By Tangotiger, 08:03 PM

From mettle:

I’m struck by: “The above players, in the season preceding those 5 years, had 5.8 WAR.” Am I right to assume the 5.8 => 4.4 1-year drop is what we’d expect from regression?

Yes.  Remember, I selected on the high WAR.  By definition, it includes alot more good luck than bad luck.

If you take 70% of 5.8 and 30% of 2.0 you get 4.66, roughly close to what we have here.

And you can do that with ANYTHING.  Go ahead and try it and report back on the results.  I’ve give you a few to start with:

1. Take the top 10 in SLG in each of the last 10 years, and tell me what the overall average SLG of these 100 players was in the following year

2. Repeat with top 10 in OBP

3. Repeat with top 10 in HR

4. Repeat with top 10 in ERA

5. Repeat with top 10 in K

I haven’t done this myself, but I already know the answer is for year T+1: It’s going to be roughly 70% of whatever their average was in year T, and 30% of whatever the league average is.

Something close to that.  Prove me wrong…


#1          (see all posts) 2010/12/09 (Thu) @ 22:42

I just ran the test for the OBP leaders during the past 10 years. Here’s what I got:

Top 10 Average OBP .434
Year T+1 Average: .408

Over the past 10 years, the T+1 OBP for the top 10 is about 73.3% Year T average and 26.7% League Average OBP.

I’ll run SLG/ERA if I have time later.

For HR and K, what do you suggest using for league average?


#2    Tangotiger      (see all posts) 2010/12/09 (Thu) @ 23:09

Great job Leo!  We need guys like you doing the work.  I can say this stuff til I’m blue in the face, but until the new people out there do the work, it’ll just be part of the inner club.

So, thanks for joining the fray.

***

For HR, maybe do HR/AB or HR/PA for the league leaders, and treat it as you would OBP or SLG.


#3    Ryan JL      (see all posts) 2010/12/10 (Fri) @ 03:32

Top 10 Average SLG: 0.6215
Year T+1 Average: 0.5576

I am not sure how to calculate the last step.

By the way, here is a(n ugly) query that immediately gets you the top-10 players and their subsequent years from BDB.  Just modify the SLG to OBP or HR/AB or whatever you’re looking for.

Someone can probably simplify; this was brute force.

select T1.playerIDT1.yearIDT1.SLGT2.yearIDT2.SLG from
(
      (
select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
       from batting 
       where 
(AB+BB+SF+SH+HBP) >= 502 
       
and yearID=2009
       order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2008
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2007
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2006
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2005
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2004
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2003
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2002
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2001
        order by SLG desc limit 10

    
union
        
(select playerIDyearID, ((H)+(2B)+(3B*2)+(HR*3))/AB as SLG
        from batting 
        where 
(AB+BB+SF+SH+HBP) >= 502 
        
and yearID=2000
        order by SLG desc limit 10

) as 
T1
join
(select playerIDyearIDsum(((H)+(2B)+(3B*2)+(HR*3)))/sum(AB) as SLG from batting group by playerIDyearID) as T2 
on 
T1
.playerID T2.playerID 
and T1.yearID T2.yearID 1  
order by T1
.yearID descT1.SLG desc


#4    tangotiger      (see all posts) 2010/12/10 (Fri) @ 08:56

Let’s say lgSLG = .410

Then:

.558 = .662 * (x) + .410 * (1-x)

Solve for x.  So, that simplifies to:

x = (.558-.410) / (.662-.410)

x = 59%


#5    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 11:51

David Pinto did it for SLG as well:
http://baseballmusings.com/?p=63091

His 10 year average in year T was .622
And in T+1 was .559

And if you did 70% of year T and 30% of league average, Pinto reports this result:
.561

***

So, yes, that illustrates it rather perfectly, doesn’t it?  If he used 69% for year T and 31% for amount of regression, it would have nailed it perfectly.


#6    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 12:06

This is what regression is all about.  This is why we have regression.  Because when you select on something, and you purposefully select the outlying observations, those observations are much more likely to have benefitted from good luck than bad luck.

The “true” of those particular entities is going to be somewhere between what you observed, and what the population mean is.

And that’s what we’re seeing here: the true rate is about 70% of the observed and 30% of the population mean.  Or, you regress 30% of the difference toward the population mean.

***

This 70% is not fixed.  It depends on a bunch of things.  But, for what we need for illustrative purposes, 70% does the trick.  Once everyone is happy that we need regression and regression works, we can show how to figure out the “70%” for whatever metric you are using.


#7    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 12:12

Regression is actually the single most important concept to understand when dealing with observed data.  Fangraphs, BPro, THT, BtB, etc, should all be taking up this challenge to show their readers that when they look at data, that part of that data is noise.  These simple examples show things plainly and simply in ways that mathematical gyrations will simply make the common person ignore this concept.

Regression, regression, regression ~= location, location, location

I remember MGL hammering this point home on the old BaseballBoards.  It took me quite a while to understand and accept it.  And really, it makes no sense to do any analysis without getting this.


#8          (see all posts) 2010/12/10 (Fri) @ 13:35

Somebody just asked me to project Sidney Crosby’s points for the season yesterday.  I said you should assume that he’ll post something like a 3-2-1 weighted average of his last three seasons, and so he’s looking at something like 120 points if he doesn’t get injured.

This answer was not well-received, but I said I could not in good conscience allow anyone to make future projections without regressing smile


#9    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 14:25

Even 3-2-1 won’t have enough regression.

Also, I presume you forecasted his remaining games, and then added his already earned points?


#10    Ryan JL      (see all posts) 2010/12/10 (Fri) @ 14:47

.558 = .662 * (x) + .410 * (1-x)

Solve for x.  So, that simplifies to:

x = (.558-.410) / (.662-.410)

x = 59%

I said .622 not .662

.5576 = .6215x + .410(1-x)
x = (.5576-.410)/(.6215-.410)
x = 69.8%

Neat! Thanks Tango.


#11    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 14:59

I guessed on .410.  According to Pinto, it was .420.  Your numbers and his were similar.  So, using your numbers:
x = (.5576-.420)/(.6215-.420) = 68.3%

***

Anyway, is everyone a believer of regression toward the mean?  Is there anyone NOT on the regression train?


#12    Martin Monkman      (see all posts) 2010/12/10 (Fri) @ 17:34

Tango, I took you up on #1 of the challenge, but went off on a bit of a tangent to add another car (or two, when I finish the next piece) to the regression train.

Summary: 2007-2008, top 25 SLG across both leagues, 66% (and it would have been lower if not for Pujols).

http://bayesball.blogspot.com/2010/12/slugging-regression.html


#13    Ryan JL      (see all posts) 2010/12/10 (Fri) @ 17:49

Average HR/AB of league leaders: 0.0812
League Average (roughly): 0.036
Average HR/AB of leaders in T+1: 0.0681

That is 71% of Year T, and 29% league average.

I know I’m doing nothing but belaboring the point, but I just find it very cool.


#14    Tangotiger      (see all posts) 2010/12/10 (Fri) @ 18:05

Please, keep belaboring!

The only way we can go on to the next town is if everyone gets on the regression train.


#15    Ryan JL      (see all posts) 2010/12/10 (Fri) @ 19:22

ERA of top-10 ERA leaders, 2000-2009: 2.86
ERA of that group of pitchers in T+1: 3.79
Average ERA over that timespan: 4.06

Can this be highlighted in the same sort of way? Something tells me it’s different.


#16    Zach      (see all posts) 2010/12/10 (Fri) @ 19:37

Ryan/15:

ERA minus league ERA in T: -1.20
ERA minus league ERA in T+1: -0.27

-1.20 * x + 0 * (1-x) = -0.27

x = 22.5%

ERA in T+1 = 22.5% of ERA in T + 77.5% of league ERA


#17    tangotiger      (see all posts) 2010/12/10 (Fri) @ 21:04

There’s a good reason for the ERA to behave differently, however:

Average ERA over that timespan: 4.06

That can’t be right.  ERA average must be around 4.30 or so.  And, since you are limiting the pitchers to starters only, probably more like 4.40.


#18    Ryan JL      (see all posts) 2010/12/10 (Fri) @ 21:25

Oops, you are correct.  That was the average of pitchers who qualified for the ERA title.  The average of pitchers with 100IP is 4.35 and the total aggregate average is 4.40.

That would give us an x of 37%, if I didn’t botch anything else.


#19    Martin Monkman      (see all posts) 2010/12/11 (Sat) @ 01:19

Regression to the SLG mean, 75+ ABs—evidence (not particularly robust, but a start) that the top hitters regress down, the weaker sluggers regress up.  Read more at
http://bayesball.blogspot.com/2010/12/slugging-regression-ii.html


#20    tangotiger      (see all posts) 2010/12/12 (Sun) @ 09:03

Pinto looks at OBP:
http://www.baseballmusings.com/?p=63107

Year T: .434
Year T+1: .407
Lg Avg: .333

Maintain rate: 73%


#21    Tyler      (see all posts) 2010/12/13 (Mon) @ 05:44

I understand the numbers suggest a 30% regression towards the league mean for the top 10 players in a given category, but what about when evaluating players outside the top ten. In a larger sample you will certainly get players who performed at, or even below, their true talent levels. How do you know to what degree to modify the regression level?


#22    Tangotiger      (see all posts) 2010/12/13 (Mon) @ 11:43

You apply regression to EVERY player (the degree of which is linked to his PA). 

This does NOT mean that you will get the exact true talent level for the player, but a best estimate, of which there will be an uncertainty level.


#23    German dude      (see all posts) 2010/12/23 (Thu) @ 08:46

I don’t know if this is interesting but I did a little study for my own amateur league in Germany (7 teams with only 24 games per season).

First of all I selected all players with at least 1 PA in both 2009 & 2010. I then split all players into two groups, one with all those players whose AVG. was above league average (.269) in 2009 and the other ones with an AVG. below league average.

Group 1: AVG .354 (500-1413)
Group 2: AVG .186 (272-1459)

Then I checked the AVG. for both groups in 2010.

Group 1: AVG .319 (425-1333)
Group 2: AVG .265 (384-1447)

Is this way of doing this excercise valid? If so, I guess it points to “regression towards the mean”, correct?

Thanks for a short feedback!


#24    tangotiger      (see all posts) 2010/12/23 (Thu) @ 08:55

As long as you weight each player the same in the two pools.  Usually, we take the minimum of the two PA for each player.

Otherwise: yes!

Don’t expect the 70% to hold though.  The 70% is specific to MLB and to one season.


#25    German dude      (see all posts) 2010/12/23 (Thu) @ 09:03

Cool. Thanks for the quick reply!


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 14:14
Pete Palmer’s new book: Basic Ball

May 25 13:18
Do pitcher’s reach back for velocity when needed?

May 25 13:04
“Why Kickstarter works”

May 25 12:51
Chad Curtis

May 25 12:40
Largest demonstration in Canadian history?

May 25 11:32
Howard Stern

May 25 11:26
Lack of hustle during a game

May 25 11:22
What sabermetrics is NOT

May 25 10:58
Rooting for laundry

May 25 02:38
NFLPA lawsuit against collusion