THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Thursday, September 25, 2008

Who wants to run a regression for me?

By Tangotiger, 05:52 PM

Here’s the data:
Age PA1 PA2
24 2250 2100
28 2250 1900
32 2250 1400
36 2250 1000
24 1500 2000
28 1500 1300
32 1500 700
36 1500 350

Your job is to use the first two columns to estimate the third.  Ideally, we want no bias.

I was thinking of something on the order of:
a*(Age-b) + c*(PA^d)
I’ve tried
a=100, b=28, c=.83, d=1
a=100, b=28, c=36, d=0.5

I’m hoping someone can come up with the best combination of the above form.  Or, if you think you need to add an extra parameter, like PA*Age, or PA/Age, by all means, do so.


#1          (see all posts) 2008/09/25 (Thu) @ 18:56

I ran a non-linear regression and used:
a=-116.875, b=30.0225, c=1.01097, d=0.954205

this gives 3rd column values of (compared to PA2):
2301 (2100)
1834 (1900)
1366 (1400)
899 (1000)
1789 (2000)
1321 (1300)
854 (700)
386 (350)


#2    tangotiger      (see all posts) 2008/09/25 (Thu) @ 20:15

Great job, thanks.

I should have noted it as b-Age, but that just means flipping the sign.

Interesting that you got b=30.  I was using 28.  Good job there.  Nice symmetry that we can just make c=1.  d is pretty close to 1 as well.  That’s good.

The one that is the killer is the age 24.  Obviously, it means little at that age what PA1 was, as both will get similar PA2.  In your case (as mine), there’s a big gap.

I’m wondering if we can do something like PA1/(Age-e) or some combination of the sort.

Again, thanks much.

Looking forward to see if anyone else can bridge the gap for the age 24 (without cheating and saying “if age = 24").


#3    terpsfan101      (see all posts) 2008/09/25 (Thu) @ 21:10

So regression can be used to estimate missing data.  This was probably a stupid question, but it’s been 8 or 9 years since I had Algerba II in high school. Typically when estimating missing data, I use a fixed rate, such as Reached on Error = .63 * Errors. Would regression be helpful in cases like this?


#4          (see all posts) 2008/09/25 (Thu) @ 21:10

I tried adding a PA1/(Age-e) term and the values didn’t budge.  I played around with some other terms, and tried adding an e*Age*PA1 term.  That seems to have reigned in the age 24’s some without too much sacrifice to the other ages.

a=226.25, b=29.78, c=-0.633, d=1.061, e=0.0583

new eqn.  (PA2) [old eqn.]
2170 (2100) [2301]
1790 (1900) [1834]
1410 (1400) [1366]
1030 (1000) [899]
1920 (2000) [1789]
1365 (1300) [1321]
810 (700) [854]
225 (350) [386]


#5    tangotiger      (see all posts) 2008/09/25 (Thu) @ 21:18

Ooohhh… sweet.  Let me play around with that.  Thanks again.


#6    salb918      (see all posts) 2008/09/25 (Thu) @ 22:45

Here’s another option: get rid of the “d” parameter and you get:

-116.875
30.5348
0.6833

with 95% confidence intervals

-149.1755 -84.5745
24.2309 36.8387
0.2981 1.0685

No singularities this time.


#7    salb918      (see all posts) 2008/09/25 (Thu) @ 22:48

That comment was meant as a follow-up to this, which didn’t post the first time:

I can somewhat confirm those numbers.  I get:

a = -116.8750
b = 30.2359
c = 0.8624
d = 0.9728

The predicted PA2 are:
2301.3
1833.8
1366.3
0898.8
1788.8
1321.3
0853.7
0386.2

However, my fitting algorithm warns that the Jacobian is ill=conditioned.  What that means is that the confidence intervals on the parameter estimates will be very large.  Indeed, the 95% confidence intervals are:

-158.1613 -75.5887
-139152281769.4312 139152281829.903
-91021776753.2635 91021776754.9884
-12333398007.9523 12333398009.8978

So basically, we have an idea for what a is, but no idea for b, c, and d.

Tango, do you have a rationale behind the model?  These models work best when they mimic some sort of physical reality so we can place upper and lower bounds on the parameters prior to running the regression.


#8    salb918      (see all posts) 2008/09/25 (Thu) @ 23:03

Adam/4 - neat idea.

If you use a similar model but remove the exponent:

PA2 = a*(age-b) + c*(PA1) + d*PA1*age

(kinda like a Taylor expansion)

You get

a = -226.2500
b = 30.2762
c = -1.0667
d = 0.0583

and predicted response

2170
1790
1410
1030
1920
1365
0810
0255

with confidence intervals:

-351.0695 -101.4305
27.8043 32.7482
-3.0466 0.9133
-0.0069 0.1236

The PA*age cross term is definitely very small.


#9    will      (see all posts) 2008/09/26 (Fri) @ 06:43

What happens if we have an exponent on the age term?


#10    Tangotiger      (see all posts) 2008/09/26 (Fri) @ 10:09

Great job guys. 

I presume you guys are using some specialized software?  If I give you a dataset with 4000 records (same columns), is that something the software can handle easily?

What I gave you was a summary of 8 lines.  When I try it out on the full data set, and trying to get “nice” numbers, I get the following through trial and error for the 5 parameters, respectively:
250 29.25 -1 1 0.0608

Let me know if Adam, Sal or anyone else wants to give it a go.

Thanks…


#11          (see all posts) 2008/09/26 (Fri) @ 10:41

Tom, I was using Mathematica which has a non-linear regression function.  If you want to email me your data set to the linked address I can give it a shot, probably tomorrow.  I can’t make any promises as I inputted your 8 summary lines manually, although I can probably figure out an easier way.


#12    Pizza Cutter      (see all posts) 2008/09/26 (Fri) @ 10:51

Eight data points aren’t going to construct much of a regression, no matter what techniques you use.  Even then, no one has reported a goodness of fit index.  That’s the real test.

I’m not at my “stat” computer (i.e. my laptop) right now, but when I get home tonight, I might give it a whirl.  Tom, I have software that can handle 4000x3 easy, if you’d like to send it over.  (I have it running the full Retrosheet data file for six years right now...)


#13    Tangotiger      (see all posts) 2008/09/26 (Fri) @ 11:38

I sent the file to Pizza, Sal, and Adam.  Let’s see what they come back with.  I’ll tell you what I did after the results…


#14    salb918      (see all posts) 2008/09/26 (Fri) @ 13:06

Got the file, thanks Tom.  I’m using MATLAB, which can run 4000x3 no problem.  I’ll give it a try tonight (actually - GASP - working at work today).

PC, what’s the preferred “goodness of fit” metric?  MSE?

I meant to point out that a 3 or 4 parameter model with only eight data points is going to give you a (close to) garbage answer pretty much no matter what.  PC beat me to it.


#15    Tangotiger      (see all posts) 2008/09/26 (Fri) @ 13:13

Actually, in the 8 data points I gave you, it was not a random sample, but a group level data based on age and PA-levels.  So, I am quite confident (positive in fact) that what results you will get from the 3000+ sample data I gave you will result in numbers that are very very close.

For example, I can ask you to run a regression of FIP to ERA, and I all have to do is give you 8 data points which is a group based on the top 1/8th in ERA, 2nd/8th in ERA, and so on, and you will get an r=.99 or so, with slope of 1 and intercept of 0.

Anyway, I’m sure (hope anyway) you guys agree with me that your claim is only based on random data, and not selected-group data as I did.

***

As for the best-fit, I think minimizing the square of the differences (RMSE) would fit the bill.  Alternatively, if 68.6% of the data points can be within “x” number of PA, that’d be nice too.  That “x” would be the lowest number possible.  I would expect however that it would match the RMSE.  However, because of the skew in the data, maybe that “x” can be a great deal lower.


#16    salb918      (see all posts) 2008/09/26 (Fri) @ 18:24

Here goes…

Using the full dataset, and using the following model:

Y = a*(b-age) + c*(PA) + d*PA.*age

I get

a = 203.0559
b = 28.9264
c = -0.4198
d = 0.0418

with uncertainties

172.0032 234.1087
28.3323 29.5205
-0.9321 0.0924
0.0254 0.0583

and RMSE 690.


#17    salb918      (see all posts) 2008/09/26 (Fri) @ 18:28

Using the full dataset, and using the following model:

Y = a*(b-age) + c*(PA^d)

I get

a = 125.812
b = 34.1585
c = 0.0002
d = 2.025

with uncertainties

118.6128 133.0112
32.2130 36.1040
-0.0007 0.0011
1.4287 2.6213

and RMSE 691.


#18    tangotiger      (see all posts) 2008/09/26 (Fri) @ 21:52

Cool, thanks.  Can you take your second one and add:
+ e*PA*age

The b=34 seems out of place, and maybe adding this extra component will clean it up.

It’s also interesting to note that the RMSE of both stayed the same.


#19          (see all posts) 2008/09/26 (Fri) @ 22:07

using Y = a*(b-age) + c*(PA^d) + e*PA*age

a = 205.969
b = 29.8
c = -1.96217
d = 0.834579
e = 0.0434345

with uncertainties
174.811, 237.126
23.2367, 36.3633
-20.5213, 16.597
-0.269965, 1.93912
0.0269306, 0.0599385

and RMSE 690.34 compared to the RMSE of 690.22 without the exponent “d term”.

i have been playing around for a while using all sorts of other functions and have not been able to get the RMSE much below 690.


#20    salb918      (see all posts) 2008/09/26 (Fri) @ 22:09

Model: PA2 = a*(b-age) + c*PA1.^d + e*PA1.*age

a 218.5957
b 126.7953
c -1.4747e+004
d 0.0572
e 0.0505

RMSE 689

confidence intervals

187.4929 249.6985
-1698.4 1952.0
-3.8661e+005 3.5712e+005
-0.9766 1.0909
0.0340 0.0670


#21    Pizza Cutter      (see all posts) 2008/09/26 (Fri) @ 23:55

I ran a model of

a*age + b*age^2 + c*pa1 + d*pa1^2 + e*age*pa1 + f

Got an RMSE of 687 something.  However, let’s look at this from a little different standpoint.  Clearly, Tom, you’re trying to predict playing time.  The guys who had zero in the PA2 column clearly retired.  Maybe they went out on their own terms, or maybe they tried desperately to catch on with someone in spring training, but they were clearly not involved in the next season.

So, I knocked out everyone who had zero PA2’s.  Using the same model as above, that knocked down the RMSE to 648 (notably, with fewer data points).  You may want to run a two-step process.  First model the chances that the player will have a job in the next year (or two), and then model what his playing time is expected to be if he comes back.


#22    tangotiger      (see all posts) 2008/09/27 (Sat) @ 08:31

What were the values of your coefficients?

***

Good point about the attrition.  I was hoping to start off with a relatively clean model first, and then start tweaking it.

What the data I am looking at is: total PA for the 4 years prior to the age in question, of all players born since 1895.  And I was looking at the total number of PA for the 4 years starting at the age in question.

So, a guy who is 24 years old, and had 2200 PA over the last 4 years got 2000 PA (or whatever the number actually was) for the 4 subsequent years.  A guy who is 36 years old in the same condition ended up with 1000 PA (or whatever the number is).

Once I have a general model like this, I was going to break down the 4 year prior PA into something more intelligent.  After all, the 36yr old can be dinged if he was actually out of baseball at age 35 (managed to get 2200 PA somehow from age 33-35).  So, I was going to include PA1, PA2, PA3, PA4 for the 4 prior years and run a regression based on that.

But in order to get a handle on the age, pa relationship, I started as I did.

I also find that working with 4 year sets, I can get a better handle on the data.

Finally, I was going to include position and quality of play as a further determinant.

So, finally, I’ll be able to say: a 27 yr old, having 600 PA last year, and 1400 PA in the three previous years, has a true talent level of 120% of league average, and is a catcher will play x number of PA over the next 4 years.

I hope there’s at least one other person as excited as I am as to the prospect…


#23    Pizza Cutter      (see all posts) 2008/09/27 (Sat) @ 11:07

With the PA2 > 0 restriction,

-413.488 * age + -1.432 * PA1 + .037 * PA1 * age + 0.000285 * PA1^2 + 3.683 * age^2 + 10039.667

RMSE = 648.84

Without the PA2 > 0 restriction,

-437.320 * age + -1.099 * PA1 + .033 * PA1 * age + 0.000248 * PA1^2 + 4.043 * age^2 + 10058.074

RMSE = 687.37


#24    tangotiger      (see all posts) 2008/09/27 (Sat) @ 16:21

Cool, I’m going to play around with these results on Monday.


#25    Tangotiger      (see all posts) 2008/10/01 (Wed) @ 11:59

Ok, let’s do it the two-step process.  What I did was take players’ total PA over a 4-year period (T-4 through T-1), and who had at least one PA in T-1.  Then among these players, figure out how many PA they had left in the next 4 years.  If they had 100 or fewer total PA over the 4 year span (T+0 through T+3), then I considered them out of baseball.

This attrition rate is:
-1.1 + PA/3000 + Age/20 - Age*PA/60000

So, a 30 year old (at T0) nonpitcher with 1500 PA in T-4 through T-1 (and who played in T-1) is expected to be out of baseball 15% of the time.

Works reasonably well.  You have to force in a bottom of 0% for the young players with lots of PA.

The next parameter I need to add is number of PA at T-1.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Dec 03 13:52
Sabermetric Moves of the 2009 Pre-Season

Dec 03 16:43
Avery being Avery

Dec 03 16:17
How to calculate the area of a baseball field

Dec 03 16:14
What would happen if the shootout period was 10 minutes, not 5?

Dec 03 14:50
The Return of the Baseball Abstract?  No, the next best thing…

Dec 03 14:48
Estimating BABIP

Dec 03 13:58
NYC’s 3 1/2 year mandatory jail time sentence for carrying a loaded weapon

Dec 03 10:42
What was Pedro worth?

Dec 03 10:20
Complete Run Expectancy, Retrosheet Years

Dec 02 23:36
The Holy Writers strike again!