THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Tuesday, August 24, 2010

Regression equations for pitcher events

By Tangotiger, 03:19 PM

Sit down, because I’m going to need your patience and attention.


In an excellent article by Harry at THT, he gives us the correlation (r) for various components at various number of trials.

For example, he had 678 pitchers with their first 250 PA, of which he split into even-odds of 125 PA each, and he got a correlation for BB/PA of r=.336 between the two groups.  What can we do with this?  Well, we can come up with a general regression equation, which is simply done as:

A = (1-.336)/.336 * 125 = 247

r = PA / (PA + 247)

That simply means that if you had two groups of pitchers each with 247 PA in each pool, and you ran a correlation between the two groups, you’d get r=.500.

In fact, Harry also ran it for pitchers through the first 500 PA (meaning 250 PA in each group), and the r for that pool was 5=.534.

So, what I did was simply run the equation to figure out the “A” for each pool of PA he ran, and I ended up with the following for BB/PA:

n: number of pitchers in pool
PA: number of PA in each pool for each pitcher
A: the number of PA at which r is estimated to be r=.50

n PA A
931 25 227
846 50 231
782 75 253
726 100 248
678 125 247
648 150 280
606 175 269
578 200 235
549 225 230
525 250 218
494 275 230
472 300 260
454 325 261
429 350 271
404 375 257
385 400 253
363 425 239
340 450 262
315 475 221
291 500 224
276 525 220
265 550 216
251 575 234
242 600 224
222 625 223
208 650 208
197 675 197
179 700 215
168 725 217
162 750 215
154 775 218
146 800 228
145 825 220
140 850 221
137 875 218
136 900 201
130 925 208
128 950 208
122 975 210
120 1000 212
112 1025 181
107 1050 190
104 1075 189
100 1100 197
93 1125 191
92 1150 171
89 1175 188
87 1200 206
86 1225 195
85 1250 227
81 1275 236
78 1300 278
73 1325 292
71 1350 314
66 1375 384
63 1400 372
63 1425 394
59 1450 441
57 1475 415
53 1500 431
52 1525 424
48 1550 421
43 1575 399
41 1600 390
38 1625 370
36 1650 333
34 1675 329
33 1700 306
32 1725 319
28 1750 197
24 1775 227
24 1800 229
23 1825 222
23 1850 219
20 1875 245
17 1900 262
16 1925 318
16 1950 319
15 1975 291
13 2000 346

As you can see, the numbers pretty much hover around 250 PA.  Therefore, we can create teh following equation to regress for any amount of BB/PA

regression rate = 250 / (250 + PA)

(Note also that the regression rate = 1 - r.)

How well does it do?  I’ll add the observed correlation (r) and the estimated r based on the above equation:
n PA A r est
931 25 227 0.10 0.09
846 50 231 0.18 0.17
782 75 253 0.23 0.23
726 100 248 0.29 0.29
678 125 247 0.34 0.33
648 150 280 0.35 0.37
606 175 269 0.39 0.41
578 200 235 0.46 0.44
549 225 230 0.49 0.47
525 250 218 0.53 0.50
494 275 230 0.54 0.52
472 300 260 0.54 0.54
454 325 261 0.56 0.56
429 350 271 0.56 0.58
404 375 257 0.59 0.60
385 400 253 0.61 0.61
363 425 239 0.64 0.63
340 450 262 0.63 0.64
315 475 221 0.68 0.65
291 500 224 0.69 0.67
276 525 220 0.71 0.68
265 550 216 0.72 0.69
251 575 234 0.71 0.70
242 600 224 0.73 0.71
222 625 223 0.74 0.71
208 650 208 0.76 0.72
197 675 197 0.77 0.73
179 700 215 0.76 0.74
168 725 217 0.77 0.74
162 750 215 0.78 0.75
154 775 218 0.78 0.76
146 800 228 0.78 0.76
145 825 220 0.79 0.77
140 850 221 0.79 0.77
137 875 218 0.80 0.78
136 900 201 0.82 0.78
130 925 208 0.82 0.79
128 950 208 0.82 0.79
122 975 210 0.82 0.80
120 1000 212 0.83 0.80
112 1025 181 0.85 0.80
107 1050 190 0.85 0.81
104 1075 189 0.85 0.81
100 1100 197 0.85 0.81
93 1125 191 0.86 0.82
92 1150 171 0.87 0.82
89 1175 188 0.86 0.82
87 1200 206 0.85 0.83
86 1225 195 0.86 0.83
85 1250 227 0.85 0.83
81 1275 236 0.84 0.84
78 1300 278 0.82 0.84
73 1325 292 0.82 0.84
71 1350 314 0.81 0.84
66 1375 384 0.78 0.85
63 1400 372 0.79 0.85
63 1425 394 0.78 0.85
59 1450 441 0.77 0.85
57 1475 415 0.78 0.85
53 1500 431 0.78 0.86
52 1525 424 0.78 0.86
48 1550 421 0.79 0.86
43 1575 399 0.80 0.86
41 1600 390 0.80 0.86
38 1625 370 0.81 0.87
36 1650 333 0.83 0.87
34 1675 329 0.84 0.87
33 1700 306 0.85 0.87
32 1725 319 0.84 0.87
28 1750 197 0.90 0.87
24 1775 227 0.89 0.88
24 1800 229 0.89 0.88
23 1825 222 0.89 0.88
23 1850 219 0.89 0.88
20 1875 245 0.88 0.88
17 1900 262 0.88 0.88
16 1925 318 0.86 0.88
16 1950 319 0.86 0.89
15 1975 291 0.87 0.89
13 2000 346 0.85 0.89

Pretty good, right?

Repeating the above steps for K/PA, and we have this regression equation:
regression rate = 83 / (83 + PA)

Simply put, you just need 83 PA in order to regress the K/PA rate by 50%.  That is, the K/PA metric stabilizes very very fast.

I’ll show the chart like this:
PA, Event
83, K
250, BB

What’s beautiful about showing it like that is you are told TWO things:
1. the r=.50 is reached at those number of PA for those events
2. the regression equation is n/(n+PA), where n is the above number

Let’s addin HBP/PA:
PA, Event
83, K
250, BB
1800, HBP

Basically, the HBP “skill” takes a long time to find.

Harry also included GB, FB, LD, and PU.  Let’s see how those are skills:
PA, Event
83, K
83, GB
245, FB
250, BB
433, PU
630, LD
1800, HBP

So, yes, there absolutely is a line drive skill.

Are there any problems here?  Well, one, and it might be big, it might not.  Harry did not select random samples, but sequential samples.  So, he might be correlating not only on pitcher, but park, and opponent too.  Ideally, (and maybe he is doing this through his intraclass correlation process), he would randomly select PA for each pool.

Great stuff.

#1    Harry Pavlidis      (see all posts) 2010/08/24 (Tue) @ 21:57

this is copy-and-paste of a comment I just left on the article at THT

BTW, re-running based on two changes
1) discovered some pre-2007 snuck in ... this will reduce sample sizes (we won’t get to 4000 at all) but will get rid of some yuk
2) randomized the plate appearance sequencing

so far, it looks like reliability is being dampened but it’s only run thru the 100 BF group level (long way to go)


#2    DSMok1      (see all posts) 2010/08/25 (Wed) @ 09:10

Brilliant work, Harry.

One question: would it not be better to look at the outcomes as part of a branching tree?  In other words: first look at K and BB and HBP per BF.  Then look at batted-ball type per BIP, rather than per BF.  Otherwise, the higher reliability of K and BB would bleed into the BIP data, wouldn’t it?  In terms that some pitchers have more BIP than others?

I like that you did outs and HR per BIP.  That’s really helpful.  I just think batted ball type should also be per BIP.


#3    DSMok1      (see all posts) 2010/08/25 (Wed) @ 09:14

Wait… shouldn’t outs and HR be per specific type (like outs vs. GB)?  Or is that how it was done?  It’s unclear.


#4    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 09:24

@DSMok1 It’s a good question, and I can run that, too.

The outs and HR in the 2nd set of charts at THT are based on specific batted ball type.


#5    DSMok1      (see all posts) 2010/08/25 (Wed) @ 10:25

I think it should be done as components, right?  Check out this old thread:

http://www.insidethebook.com/ee/index.php/site/comments/archives_component_regression/

http://www.insidethebook.com/ee/index.php/site/article/best_fit_equations_for_component_aging_curves/
(Gives a view of components to use)

I can’t find Tango’s really old threads about the appropriate way to branch an at-bat into components.


#6    Tangotiger      (see all posts) 2010/08/25 (Wed) @ 11:19

Definitely, it should be done as components.  As for the “right way”, there is no right way.  Pretty much anything half-reasonable will work.

Here’s the Voros method:
http://www.tangotiger.net/agepatterns.txt


#7    DSMok1      (see all posts) 2010/08/25 (Wed) @ 12:06

I guess walks and strikeouts can be split out concurrently (I don’t know why you’d really choose to do walks before the strikeouts).  HBP also.  Then the batted-ball types concurrently?  Should grounders be done first, since they are typically considered to be more skill-dependent?

I’m still trying to get my head around the order of such components, and why.


#8    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 12:19

For tRA, Graham does K/BF, BB/BF, HBP/BF, GB/BIP, LD/BIA, PU/BIA in that order IIRC


#9    DSMok1      (see all posts) 2010/08/25 (Wed) @ 12:30

I guess the whole point of the order of components is to:
1) Remove first results that directly influence the number of opportunities for others (i.e. K, BB, HBP before batted-ball results)
2) Remove from concurrent results the most consistent ones first, since they influence the chances for the less-consistent concurrent results.  I’m not sure this should always be done.

I like the order that Graham does them: concurrently K,BB,HBP, then GB, then LD, PU, and Fly concurrently.  Right?


#10    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 12:34

http://statcorner.com/tRAabout.html


#11    DSMok1      (see all posts) 2010/08/25 (Wed) @ 12:39

So it’s actually IFF before LD.  I think I agree with that.

The problem is the last few things are so collinear--it’s going to be very difficult to disentangle the correlations.  Would the best way be to look at them concurrently, and then choose the one with the highest correlation to run first?  Or would running LD, IFF, and OFF concurrently work just fine?


#12    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 12:45

My thinking, incomplete and subject to change, on this topic amounts to two things. First, simple is good to start, then wander towards complexity if you’re not past the point of diminishing returns already. That’s why I started with per BF.

Where I think it should go? Regress BIP:NIP (balls in play [gb ld fb pu] to not in play [k bb hbp]), then work within NIP to regress K, BB and HBP per NIP. With BIP, regress GB:BIA (balls in air). Within BIA, regress two of the rates, let the third fall-out. Once you have regressed the GB/LD/FB/PU you can then start regressing the outcomes by batted ball type (HR per FB, Outs per GB etc etc)


#13    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 18:17

Random sequencing makes a difference.

Here are revised values, running five (or more) random sequencing of plate appearances. There’s some variance in the point where r=.5 by random sequence, so the numbers below may be a little bit squishy.

Here are Tango’s values from above along with the new value
PA, Event, NEW
83, K , 120
83, GB , 170
245, FB , 330
250, BB , 430
433, PU , 500
630, LD , 1600
1800, HBP , 1800


#14    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 18:21

So, here’s what happens when you find r=.5 and use that to estimate at each # of batters faced

http://flic.kr/p/8vsJqW


#15    Tangotiger      (see all posts) 2010/08/25 (Wed) @ 19:25

Harry,

I just want to point out that I did not just look at r=.50.  I looked at all your data points.  If you send me your tabular data again, I can give you my estimate of the best fit of r=.50


#16    Harry Pavlidis      (see all posts) 2010/08/25 (Wed) @ 19:32

Just sent it.

Running LD,FB,PU per BIA now


#17    Brian Cartwright      (see all posts) 2010/08/26 (Thu) @ 01:19

>For tRA, Graham does K/BF, BB/BF, HBP/BF, GB/BIP, >LD/BIA, PU/BIA in that order IIRC

Which is just about how I do Oliver - BB, so & HP, then HR on Balls contacted, base hits on balls inplay, doubles on base hits, triples on doubles.

Now with the four types of batted balls, the rates have to sum to 1.00. As GB stabilizes much quicker (170) than the other three, I’d want to do that first. Then FB as a pct of air balls, PU as a pct of PU and LD, then LD is whatever’s left over.

I feel this confirms why I didn’t like concentrating on LD rate in evaluating batters and pitchers. I’d much rather start with GB rate, then regress FB, PU & LD to the pct of each per airball for pitchers of that given GB rate.

To illustrate, all mlb pitchers 2005-2010, grouped by gb rate, LD, FB, PU expressed as a pct of air balls

GB%  LD  FB  PU
.30 .28 .54 .18
.35 .30 .53 .17
.40 .32 .53 .15
.45 .35 .52 .13
.50 .37 .51 .12
.55 .40 .49 .11
.60 .43 .48 .09

So if a pitcher has a GB% of .30, regress his LD% to .28 - if the GB% is .60, regress the LD% to .43 (of air balls).


#18    Harry Pavlidis      (see all posts) 2010/08/26 (Thu) @ 07:44

@Brian thanks for sharing that


#19    Tangotiger      (see all posts) 2010/08/26 (Thu) @ 10:12

Harry gave me a new run of his data using the randomization, and I get the following for the best fit for the estimated at r=.50 (I used all data points):

PA, Event
70, GB
80, K
180, FB
195, BB
250, PU
880, LD
1120, HBP

As discussed, it all depends on the sequencing (what you put in the denominator).  If all the above use PA as the denominator, then they are directly comparable in terms of what stabilizes faster.

If on the other hand, it’s broken down into binary components, then they are not directly comparable.  One way to do the binary component is to remove from PA the events in this order:
HBP
BB
SO
GB
PU
LD

So, you would do HBP/PA
Then you would do BB/(PA-HBP)
And so on.

You try to logically remove those things that occur before other things should occur, so that you are trying to compare things to their “true” opportunities.  Obviously, it’s not so easy in all cases.  The above is one possible sequence, not even necessarily a suggested sequence.


#20    DSMok1      (see all posts) 2010/08/26 (Thu) @ 15:01

Wow.  In a per-PA sense, groundball rate stabilizes FAST!

Once it’s done in a sequence, we’ll have a really good feel for how to predict batted-ball types.


#21    Tangotiger      (see all posts) 2010/08/26 (Thu) @ 15:09

I should say that none of this is really news.  We’ve talked about this a few years ago, and I came up with numbers along those lines.  I don’t remember exactly what it was.

I remember having thread discussions with David Gassko I think, and probably studes and maybe Patriot.  Probably in 2007.  Check the “Batted_ball” category link.


#22    Harry Pavlidis      (see all posts) 2010/08/26 (Thu) @ 15:11

Tom, if the r=.50 at, say, 800 BF, meaning two split halves of 400 correlate at .5 (in short). You’re reporting/estimating an “A” approx half, roughly 400 in this hypothetical, not 800. Why is that? The half is not reliable, but the whole is.

For example. With 150 batters faced, the split half r for K/BF is (in one random seq.) .52, which predicts A of 138. This is close to what other observed r values at other BF predict. Your method would spit out 80, which is not a reliable sample nor is it half (at larger BF levels it gets closer to half the actual sample).

That’s why my A values are larger than yours. What am I missing here?


#23    DSMok1      (see all posts) 2010/08/26 (Thu) @ 15:48

@ Tango/21

I haven’t seen split half correlation tests for batted ball types, properly sequenced.  IIRC, Pizza’s StatSpeak article on when things stabilized was per PA, rather than sequenced.



#25    Tangotiger      (see all posts) 2010/08/26 (Thu) @ 16:29

Harry, think of it if you do year-to-year correlation, with 400 PA in each season.

If you get an r=.80, what does this mean?

Well, you would need to add 100 PA of league average performance to the 400 PA of year X in order to estimate the performance of year X+1.

In your case, you are taking the 800 PA and doing your magic to get r=.80.  It’s still 400 PA in each sample, and so, that’s how I have to handle it.

Do you disagree?

I’m not 100% sure I’m right, but I think I’m 99% sure.


#26    DSMok1      (see all posts) 2010/08/26 (Thu) @ 16:41

I think Tango’s right on this one.  I’m 99% sure also.


#27    Harry Pavlidis      (see all posts) 2010/08/26 (Thu) @ 16:46

@TT/25 I see. Going with a temporally sequenced example, say I collect 30 plate appearances for Johnny Strongarm. He strikes-out 10 guys, as he is “teh awesome”. His next 30, he K’s 9 guys. An amazingly high correlation. Turns out, that stat is reliable around 30 PA, and going out to 60 PA proved it (across a larger sample blah blah blah). The point is, the 2nd half, as it were, shows how reliable the first half was or wasn’t.


#28    Tangotiger      (see all posts) 2010/08/26 (Thu) @ 16:51

Right, I think we are in agreement then?


#29    Harry Pavlidis      (see all posts) 2010/08/26 (Thu) @ 17:00

Indeed we are.

Next question: how are you “fitting” A? It seems right around average or median of the collection of estimated A (I’m running a fresh batch now, btw, will be done after I get home tonight), but not a weighted average but n .... ?


#30    Tangotiger      (see all posts) 2010/08/26 (Thu) @ 17:20

I’m weighting by n*PA.  So, if you have 100 pitchers at 250 PA, that’s a weight of 25000.

I don’t know if that is correct, but, it’s better than a simple average.  Maybe I should do sqrt(n)*PA as the weight.

Plus, I do a little extra thing of trial and error that I should try to figure out mathematically at some point.  We’re not trying to best-fit “A”, but the estimate of r compared to actual r.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 05:18
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 12 04:55
Who is Jeremy Lin?

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 00:40
Clutch analogy

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential