THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Sunday, July 27, 2008

Minor to major correlations

By , 12:00 AM

I ran year-to-year correlations for various stats for players who played in AA or AAA in one year and in the majors the next year.  They had to have at least 300 PA in both years.  I regressed minors (AA or AAA only) 00 on majors 01, minors 01 on majors 02, etc., from 00 to 07.  The minor league stats are MLE’s and are park neutral.  The major league stats are park (and opponent) neutral also.  Here are the results:


Keep in mind that my MLE coefficients are based on players from pre-2000 (that played in the minors and majors in the same year and in consecutive years), so that I am not “cheating” in calculating these correlations, although since they are linear coefficients, I am not sure that would matter.  Anyway....

The average number of minor league PA was 450, and the next year, the average number of major league PA was 464.

There were 195 data pairs (player seasons) in the regression.

BA .285
OBA .384
SA .388
OPS .331
Sngl per PA .604
Dbl per PA .219
Trp per PA .445
HR per PA .665
BB per PA .699
SO per PA .756

Those are some decent correlations ("r").

Here are the corresponding numbers for players who played in the majors in one year and the majors in the next year, again with a min of 200 PA per year (average of 503/496):

N=1875

BA .391
OBA .565
SA .607
OPS .593
Sngl per PA .580
Dbl per PA .230
Trp per PA .417
HR per PA .721
BB per PA .744
SO per PA .837

#1    studes      (see all posts) 2008/07/27 (Sun) @ 09:25

Awesome, MGL.  Thanks.  Question: how much might sample bias affect these relative correlations?  Seems to me that players who are successful enough to rack up 300 PA in the majors in their first year are players who have been “living up” to expectations based on their minor league stats.

I know there’s no exact answer to a question like that, but do you think it’s a factor?


#2    Peter Jensen      (see all posts) 2008/07/27 (Sun) @ 11:05

To follow up on Studes #1 with specific questions for you, how many minor league players whose MLE’s projected them to be better than a player that did get 300 PA’s with their major league team did not reach 300 PA’s?  How many PA’s did they get on average? What was their actual combined total major league performance and what was their total combined expected major league performance (each player’s actual PA’s times his expected MLE added together and then averaged)?


#3    MGL      (see all posts) 2008/07/27 (Sun) @ 14:10

Studes, good questions.  Selective sampling always comes into play with players from the minors to the majors and also when setting minimum number of PA to include in your samples.

First of all, players who make the majors will tend to have gotten lucky in the previous year in the minors and in the minors in general.  So the minor league stats in these regressions (the independent variable, although either one can be called the independent variable in these 2 variable regressions) tend to bunch around a “good” number.  I am not really sure how that affects a correlation.  It would probably weaken it I would think.

Then, as you mention, in the second year, the majors, in order to rack up at least 300 PA, players probably have to get a little lucky, which would also bunch their numbers toward the “good”.  Maybe that improves the correlations since the same thing happened in the minors in the previous year (tended to be lucky).  I am not sure.

And, as you also mentioned, in order to rack up 300 PA in the majors as a rookie, not only do you have to be a little lucky, but you probably have to somewhat “match” what you did in the minors.  So a player who was a defensive specialist and/or plays C or SS or CF, probably does not need to get quite as lucky as a corner outfielder or 1B in order to rack up 300 PA.  This effect might tend to increase the correlations across the board.  Again, I don’t really know.  It is complicated.

To some extent, the same sort of selective sampling problems affect players already in the majors when you do the y-t-y regressions/correlations, as in my second set of numbers.  Players as a group who have back to back seasons of at least 300 PA tend to have gotten a little lucky in the first of the 2 years, etc.

I think we have to accept these correlations yet take them with a grain of salt.

Peter, #2, good questions.  I’ll have to work on that if I get the time.


#4    Colin Wyers      (see all posts) 2008/07/27 (Sun) @ 17:45

Have you tried using a weighted correlation? In that event you don’t have to set any kind of a PA limit.

This is the formula I use in Excel:

=(SUM(CH_TWO*(ZR_ONE-SUMPRODUCT(CH_TWO*ZR_ONE)/SUM(CH_TWO))*(ZR_TWO-SUMPRODUCT(CH_TWO*ZR_TWO)/SUM(CH_TWO))))/(SQRT(SUM(CH_TWO*(ZR_ONE-SUMPRODUCT(CH_TWO*ZR_ONE)/SUM(CH_TWO))^2)*SUM(CH_TWO*(ZR_TWO-SUMPRODUCT(CH_TWO*ZR_TWO)/SUM(CH_TWO))^2)))

In this case, ZR_ONE and ZR_TWO are the splits, and CH_TWO are the weights. It’s important to use the smaller of the two weights - so if a guy has 2 PAs in the majors and 34 in the minors, 2 is the weight. It’s an array formula, so use CTRL-SHIFT-ENTER instead of just Enter when inputting the formula.


#5    tangotiger      (see all posts) 2008/07/27 (Sun) @ 18:25

MGL, your “r” for the BA, OBP, SLG in the first set seem wrong.  They are certainly not consistent with the BB, H, etc numbers in the same group.  Were you reporting r-squared there?

You also said “min 300” in one case and “again min 200” in the other.

Finally, I’ll echo Colin, that you might as well just use all data, as I have with the Forecasting study I did a while earlier, weighting by the lesser of two PAs.


#6    GEO.      (see all posts) 2008/07/27 (Sun) @ 20:07

We’ve always wanted to know what a minor league player must accomplish as a hitter or pitcher in order to be considered major league ready.

Why not study the stats of major leaguers on rehab assignment (obviously major league ready) and apply it to past and present minor leaguers to determine applicability of the formula?

The individual samples may be small; however, the combined totals (rates or percentages) should give us a good idea.


#7    MGL      (see all posts) 2008/07/27 (Sun) @ 20:07

No, my BA, OBP, and SA are all “r”.  I thought it was odd for them to be so low for the minors to majors compared to the majors to majors and for the individual components to be so similar. I’ll recheck.

Plus, I’ll use all of the data and weight by the min of the two PA’s.

Actually, I am not sure how to weight in a regression/correlation formula.  I don’t use Excel.  I use the formula for “r”:

r=(n*txy-tx*ty)/sqr(n*tx2-tx^2)*(n*ty2-ty^2)

where txy is the sum of all the x*y’s, and tx is the sum of all the x’s, ty is the sum of all the y’s, tx2 is the sum of all the x*x, and ty2 is the sum of all the y*y.

Anyone know what the formula would be if I weighted each data pair by PA?


#8    greenback06      (see all posts) 2008/07/27 (Sun) @ 20:10

Do young players correlate as well from y-t-y as middle age or older players?


#9          (see all posts) 2008/07/27 (Sun) @ 20:43

Very interesting!

Wouldn’t you expect majors/minors correlations to be pretty much the same as majors/majors correlations?  That is, if (for instance) there was a fixed 20% drop in performance from minors to majors, that would affect the regression coefficients, but not the correlation.  Right?

So why the difference?  Selective sampling is one possibility.  Different players having different majors/minors drops (different skillsets?) could be another.


#10    dq      (see all posts) 2008/07/27 (Sun) @ 21:45

#6 the problem with looking at major league players on rehab is they are not major league ready - they have a injury (or situation) that makes them perform at less than 100% - probably not very quantifiable. So, you will have major leaguers performing at 60-90% efficiency -

Their last performances would probably be major league ready, but their 1st ones may or may not.


#11    Colin Wyers      (see all posts) 2008/07/27 (Sun) @ 21:54

The other issue with looking at exclusively rehab assignments is that players on a rehab assignment aren’t “playing to win,” necessarily, just working on maybe one or two things. If a pitcher is rehabbing, for example, he may exclusively throw one of his breaking pitches that he’s trying to get a feel for. No pitcher in MLB exclusively throws his worst pitch repeated, to the point where you know - not just suspect - that he’s going to throw it.

As far as the weighted correlation - I’d have to double check the formula, and I’ve got an hour to prepare for being a fill-in Dungeon Master (I am your stereotype!), but I’m pretty sure that you use the same formula as you posted, just using weighted averages instead of sums. The full formula I use is from here:

http://ije.oxfordjournals.org/cgi/content/full/34/4/837


#12          (see all posts) 2008/07/27 (Sun) @ 22:28

Bill James said a long time ago that minor league batting performance predicts major league performance as well as major league batting performance does.

I have always disagree with that and still do.  At least as far as we can measure both.

The reason is multi-fold.

For instance:

One, almost everyone agrees that there are some players who are AAA players and that there is not a smooth transition from minors to majors.  Although I disagree with that notion, you can’t have both.

Two, there is so much “slop” in minor league performance, because of park factors, league factors, and opponent factors, we could NEVER get a correlation from minors to majors to equal majors to majors.  Never.

I re-ran the minors to majors correlations with no min PA in either year (actually had to have at least 10 PA, which is inconsequential).  I weighted the correlations by PA simply by assuming that a pair weighted by 200 PA “counted” twice that of a pair with 100 PA.  So if I had only two pair of data (2 players), one weighted bu 200 and the other weighted by 100, it would be like I had 3 players, the first one counted twice and the second one counted once. I think that is the way to do it (the weighting), or at least one way.

Minors to majors

N=1738
Av PA used for weighting = 126

BA .223
OBA .258
SA .257
OPS .229
S .415
D .139
T .240
HR .474
BB.501
SO .665

Majors to majors

N=3295
Av PA used for weighting = 295

BA .409
OBA .556
SA .599
OPS .586
S .576
D .234
T .411
HR .709
BB.724
SO .814

Minors to minors

N=4310
Av PA used for weighting = 255

BA .307
OBA .378
SA .422
OPS .371
S .488
D .265
T .334
HR .607
BB .592
SO .746

Remember these are MLE’s that are being regressed.

Here are some of the same correlations broken down by age categories:

Minors to majors

26 or younger in year 1 (minors)

N=819
Av PA used for weighting = 152

BA .230
OBA .264
SA .287
OPS .245
S .448
D .168
T .244
HR .518
BB.536
SO .692

Minors to majors

From 26-29 in first year (minors)

N=594
Av PA used for weighting = 118

BA .241
OBA .311
SA .268
OPS .262
S .422
D .084
T .217
HR .503
BB.515
SO .667

Minors to majors

30+ in first year (minors)

N=325
Av PA used for weighting = 75

BA .168
OBA .144
SA .142
OPS .118
S .279
D .156
T .159
HR .246
BB.362
SO .525

Majors to majors

Under 26 in first year

N=1002
Av PA used for weighting = 252

BA .366
OBA .467
SA .529
OPS .506
S .529
D .204
T .401
HR .670
BB.664
SO .784

Majors to majors

26-29 in first year

N=1406
Av PA used for weighting = 314

BA .416
OBA .546
SA .573
OPS .559
S .595
D .256
T .367
HR .702
BB .700
SO .828

Majors to majors

30+ in first year

N=1377
Av PA used for weighting = 314

BA .439
OBA .605
SA .653
OPS .646
S .587
D .226
T .385
HR .731
BB.755
SO .823

Majors to majors

35+ in first year

N=389
Av PA used for weighting = 288

BA .445
OBA .642
SA .721
OPS .713
S .549
D .300
T .380
HR .772
BB .749
SO .783


#13    Rally      (see all posts) 2008/07/28 (Mon) @ 00:18

MGL,

Can you express those numbers using Tango’s formula showing how many PA are needed to regress 50%?

Looking at the first 2 sets of data, I don’t know what is better, an r = .665 at 126 average PA or an .814 at 295 PA.

Plugging the numbers in I get 68 and 64 respectively, so given your sample size, the r = .665 for minors to majors strikeouts is a slightly stronger correlation than for the majors to majors.


#14    tangotiger      (see all posts) 2008/07/28 (Mon) @ 00:49

Doing just the first two:

Minors to majors

N=1738
Av PA used for weighting = 126

BA .223
OBA .258
SA .257
OPS .229

I find it hard to believe btw that the OPS correlation would be less than OBP and SLG.

Anyway, PA at r=.50 is (1-.229)/.229*126 = 424

So, 126 / (126+424) = .229

At r=.50, the PA for the majors to majors is:
(1-.586)/.586*295 = 208

For what it’s worth, I always use x=200 in my regression formula for:
r = PA / (PA+x)

That MGL is getting 208 is very comforting.

I’ll leave it to you guys to do the other PA at r=.50.


#15    MGL      (see all posts) 2008/07/28 (Mon) @ 01:04

I find it hard to believe btw that the OPS correlation would be less than OBP and SLG.

Why?  Ideally they should be about the same, no?  But with fluctuation, a little more or a little less?


#16    MGL      (see all posts) 2008/07/28 (Mon) @ 01:14

As I’ve said before, I am not crazy about assuming that when you use every player in the regression, regardless of the number of PA in either of the two years, and then you look at the average number of PA that you used to weight each data pair, that this is the same thing as assuming that every data pair has exactly that many number of PA.

For example, let’s say that you have these data pairs in your regression:

Player A year 1: 300 PA .350 OPS
Player A year 2: 20 PA .340 OPS

Player B year 1: 200 PA .330 OPS
Player B year 2: 100 PA .350 OPS

Player C year 1: 50 PA .310 OPS
Player C year 2: 5 PA .315 OPS

So we use 20, 100, and 5 PA for the weights.  Which gives us an “average number of PA” (that we used for the weightings) of 41.7.

Will the resultant correlation using the weightings if 20, 100, and 5 be around the same as if we had 3 players whose PA in each of the 2 years was exactly 41.7?  I don’t know.  I would have to run some “sims” to check this. 

This is essentially what we are doing.  Assuming that whatever we get for our “r” applies to a player who has 41.7 PA in one time period.

So we are assuming that if we had in our regression 20 of these players:

Player A year 1: .350 OPS
Player A year 2: .340 OPS

100 of these players:

Player B year 1: .330 OPS
Player B year 2: .350 OPS

and 5 of these players:

Player C year 1: .310 OPS
Player C year 2: .315 OPS

That the resultant “r” if we ran a regression of these 125 data pairs

Would be exactly the same as if we had 3 players, each with 41.7 PA and if we used a weighted average of each of the above players’ year 1 and year 2 OPS.  Or something like that.

Maybe it is close enough, but I am not really sure and I never got a satisfactory answer from any of the statisticians like Pizza or Andy.


#17    tangotiger      (see all posts) 2008/07/28 (Mon) @ 07:49

Andy I think does 1/PA, take the average, and then do 1/average.  Harmonic mean I believe it is called.


#18    MGL      (see all posts) 2008/07/28 (Mon) @ 11:24

Sure, I occasionally mention that taking the min of two numbers, which we often do, is a shortcut (and is more “conservative") for the harmonic mean.


#19    Tangotiger      (see all posts) 2008/07/28 (Mon) @ 12:02

If it wasn’t clear, I was responding to this:

So we use 20, 100, and 5 PA for the weights.  Which gives us an “average number of PA” (that we used for the weightings) of 41.7.

Will the resultant correlation using the weightings if 20, 100, and 5 be around the same as if we had 3 players whose PA in each of the 2 years was exactly 41.7?  I don’t know.

The harmonic mean is 11.5, as opposed to the regular mean of 41.7.

So, the suggestion is that 3 players of 11.5 would be the same as a 5, 20, and 100 in terms of the reliability of the correlation.


#20    MGL      (see all posts) 2008/07/28 (Mon) @ 22:13

Tango, I don’t think that is right. He (Andy) is not suggesting taking the harmonic mean of all the “min PA’s.” He is suggesting using the harmonic mean (of the two PA’s) for each data pair rather the min of the two PA’s.  Why are you computing the harmonic mean of 20, 100, and 5?

Unless I have that wrong.  Are you suggesting that we use the harmonic mean of the two PA numbers for each data pair, and THEN weight each data pair be the harmonic mean of all the harmonic means?  I don’t think that is right.


#21    MGL      (see all posts) 2008/07/29 (Tue) @ 01:35

be the harmonic mean of all the harmonic means?

“be” should be “by”


#22    tangotiger      (see all posts) 2008/07/29 (Tue) @ 07:33

Your second paragraph.

I don’t know if that is correct but it should be easy enough to test.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jan 08 04:25
Sabermetric Moves of the 2009 Pre-Season

Jan 09 02:33
Cheers

Jan 08 23:45
The first Hardball Times Annual available for download!

Jan 08 21:16
Line Drives

Jan 08 20:23
(recent) Historical WAR on Fangraphs

Jan 08 16:07
Clint Eastwood is Archie Bunker

Jan 08 16:06
Hardball Times Annual 2008, starring…

Jan 08 15:58
Madoff’s Ponzi

Jan 08 03:41
Valuing relievers

Jan 07 17:41
The latest in park factors