Tuesday, August 24, 2010
Regression equations for pitcher events
Sit down, because I’m going to need your patience and attention.
In an excellent article by Harry at THT, he gives us the correlation (r) for various components at various number of trials.
For example, he had 678 pitchers with their first 250 PA, of which he split into even-odds of 125 PA each, and he got a correlation for BB/PA of r=.336 between the two groups. What can we do with this? Well, we can come up with a general regression equation, which is simply done as:
A = (1-.336)/.336 * 125 = 247
r = PA / (PA + 247)
That simply means that if you had two groups of pitchers each with 247 PA in each pool, and you ran a correlation between the two groups, you’d get r=.500.
In fact, Harry also ran it for pitchers through the first 500 PA (meaning 250 PA in each group), and the r for that pool was 5=.534.
So, what I did was simply run the equation to figure out the “A” for each pool of PA he ran, and I ended up with the following for BB/PA:
n: number of pitchers in pool
PA: number of PA in each pool for each pitcher
A: the number of PA at which r is estimated to be r=.50
n PA A
931 25 227
846 50 231
782 75 253
726 100 248
678 125 247
648 150 280
606 175 269
578 200 235
549 225 230
525 250 218
494 275 230
472 300 260
454 325 261
429 350 271
404 375 257
385 400 253
363 425 239
340 450 262
315 475 221
291 500 224
276 525 220
265 550 216
251 575 234
242 600 224
222 625 223
208 650 208
197 675 197
179 700 215
168 725 217
162 750 215
154 775 218
146 800 228
145 825 220
140 850 221
137 875 218
136 900 201
130 925 208
128 950 208
122 975 210
120 1000 212
112 1025 181
107 1050 190
104 1075 189
100 1100 197
93 1125 191
92 1150 171
89 1175 188
87 1200 206
86 1225 195
85 1250 227
81 1275 236
78 1300 278
73 1325 292
71 1350 314
66 1375 384
63 1400 372
63 1425 394
59 1450 441
57 1475 415
53 1500 431
52 1525 424
48 1550 421
43 1575 399
41 1600 390
38 1625 370
36 1650 333
34 1675 329
33 1700 306
32 1725 319
28 1750 197
24 1775 227
24 1800 229
23 1825 222
23 1850 219
20 1875 245
17 1900 262
16 1925 318
16 1950 319
15 1975 291
13 2000 346
As you can see, the numbers pretty much hover around 250 PA. Therefore, we can create teh following equation to regress for any amount of BB/PA
regression rate = 250 / (250 + PA)
(Note also that the regression rate = 1 - r.)
How well does it do? I’ll add the observed correlation (r) and the estimated r based on the above equation:
n PA A r est
931 25 227 0.10 0.09
846 50 231 0.18 0.17
782 75 253 0.23 0.23
726 100 248 0.29 0.29
678 125 247 0.34 0.33
648 150 280 0.35 0.37
606 175 269 0.39 0.41
578 200 235 0.46 0.44
549 225 230 0.49 0.47
525 250 218 0.53 0.50
494 275 230 0.54 0.52
472 300 260 0.54 0.54
454 325 261 0.56 0.56
429 350 271 0.56 0.58
404 375 257 0.59 0.60
385 400 253 0.61 0.61
363 425 239 0.64 0.63
340 450 262 0.63 0.64
315 475 221 0.68 0.65
291 500 224 0.69 0.67
276 525 220 0.71 0.68
265 550 216 0.72 0.69
251 575 234 0.71 0.70
242 600 224 0.73 0.71
222 625 223 0.74 0.71
208 650 208 0.76 0.72
197 675 197 0.77 0.73
179 700 215 0.76 0.74
168 725 217 0.77 0.74
162 750 215 0.78 0.75
154 775 218 0.78 0.76
146 800 228 0.78 0.76
145 825 220 0.79 0.77
140 850 221 0.79 0.77
137 875 218 0.80 0.78
136 900 201 0.82 0.78
130 925 208 0.82 0.79
128 950 208 0.82 0.79
122 975 210 0.82 0.80
120 1000 212 0.83 0.80
112 1025 181 0.85 0.80
107 1050 190 0.85 0.81
104 1075 189 0.85 0.81
100 1100 197 0.85 0.81
93 1125 191 0.86 0.82
92 1150 171 0.87 0.82
89 1175 188 0.86 0.82
87 1200 206 0.85 0.83
86 1225 195 0.86 0.83
85 1250 227 0.85 0.83
81 1275 236 0.84 0.84
78 1300 278 0.82 0.84
73 1325 292 0.82 0.84
71 1350 314 0.81 0.84
66 1375 384 0.78 0.85
63 1400 372 0.79 0.85
63 1425 394 0.78 0.85
59 1450 441 0.77 0.85
57 1475 415 0.78 0.85
53 1500 431 0.78 0.86
52 1525 424 0.78 0.86
48 1550 421 0.79 0.86
43 1575 399 0.80 0.86
41 1600 390 0.80 0.86
38 1625 370 0.81 0.87
36 1650 333 0.83 0.87
34 1675 329 0.84 0.87
33 1700 306 0.85 0.87
32 1725 319 0.84 0.87
28 1750 197 0.90 0.87
24 1775 227 0.89 0.88
24 1800 229 0.89 0.88
23 1825 222 0.89 0.88
23 1850 219 0.89 0.88
20 1875 245 0.88 0.88
17 1900 262 0.88 0.88
16 1925 318 0.86 0.88
16 1950 319 0.86 0.89
15 1975 291 0.87 0.89
13 2000 346 0.85 0.89
Pretty good, right?
Repeating the above steps for K/PA, and we have this regression equation:
regression rate = 83 / (83 + PA)
Simply put, you just need 83 PA in order to regress the K/PA rate by 50%. That is, the K/PA metric stabilizes very very fast.
I’ll show the chart like this:
PA, Event
83, K
250, BB
What’s beautiful about showing it like that is you are told TWO things:
1. the r=.50 is reached at those number of PA for those events
2. the regression equation is n/(n+PA), where n is the above number
Let’s addin HBP/PA:
PA, Event
83, K
250, BB
1800, HBP
Basically, the HBP “skill” takes a long time to find.
Harry also included GB, FB, LD, and PU. Let’s see how those are skills:
PA, Event
83, K
83, GB
245, FB
250, BB
433, PU
630, LD
1800, HBP
So, yes, there absolutely is a line drive skill.
Are there any problems here? Well, one, and it might be big, it might not. Harry did not select random samples, but sequential samples. So, he might be correlating not only on pitcher, but park, and opponent too. Ideally, (and maybe he is doing this through his intraclass correlation process), he would randomly select PA for each pool.
Great stuff.


this is copy-and-paste of a comment I just left on the article at THT
BTW, re-running based on two changes
1) discovered some pre-2007 snuck in ... this will reduce sample sizes (we won’t get to 4000 at all) but will get rid of some yuk
2) randomized the plate appearance sequencing
so far, it looks like reliability is being dampened but it’s only run thru the 100 BF group level (long way to go)