THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews
If you are a media member and would like a review copy of The Book, please contact Kevin Cuddihy of Potomac Books.

Buy The Book from Amazon

MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, November 16, 2007

Reliability of statistics

By Tangotiger, 04:44 PM

Print, read, put aside, re-read then come back here.

I really wish Pizza would have included the mean, not just the minimum, for each stat, but he said he’d get back to that later.  The “intraclass correlation"-type equation I use is:
r = PA/(PA+x), where x is unique to each metric. 
For things like OBP or wOBA, x is 200.  This means that to get an r=.50, you need 200 PA.  Pizza likes to use r=.70, which means you need a mean of 467.  Pizza showed us that you need a minimum of 350 PA (which in the context that he chose, means a range of 350 to 700-odd PA), and therefore likely supports the standard equation that I use.

It’s a great post that he did, and a great service that he’s doing.  But, I take exception to this part:


Context Neutral wins (sum (WPA/LI)) - never did.  at 650 PA, it was at .588

The implication here is that you get an r=.588 at around a mean of x=675PA or so, meaning you’d get a correlation equation of r=PA/(PA+480).  And that’s ridiculous.  The sum of WPA/LI should be virtually identical to wOBA or OPS or LWTS or anything else in terms of reliability. 

Here’s a standard year-to-year correlation from Fangraphs, where he shows in the main blog entry, plus the 4th comment (data from 05/06):

AVG: .12
WPA: .27
BRAA: .35
OBP: .36
OPS: .36
WPA/LI for 2005 to 2006 was .36. For Clutch, it’s .01, as suspected.
SLG: .38

Those numbers are r-squared, not r. As you can see, WPA/LI was at the same level as OPS.  I definitely think that Pizza made a calculation goof somewhere.

That rant aside, great work.

As for calculating the mean, most people will just take the straight mean.  But, as Andy has shown me, what you really want to do is take the average of 1/PA, and then take 1/that.  This has to do with the variance, and if you play around with it, you’ll see that it makes sense.  Which may be the reason that Pizza didn’t present the mean, because he wants to describe something like that.

#1          (see all posts) 2007/11/16 (Fri) @ 17:57

Tom, what do you think of my suggestions for this topic that I wrote in the comments section? (same username)


#2    Chris Long      (see all posts) 2007/11/16 (Fri) @ 18:33

You wrote:

As for calculating the mean, most people will just take the straight mean.

Is this referring to mean number of PAs until you reach the 0.70 correlation point?  It’d be better to use random subsets for cross-validation than odd and even PAs, and not any more difficult to code.

-Chris


#3    MGL      (see all posts) 2007/11/16 (Fri) @ 19:32

I haven’t read Pizza’s post yet, but I was going to post this anyway and it seems like it is relevant:

Let’s say we have a bunch of players all with 100 PA (or AB) each.  We want to know the expected variance by chance, assuming that they all have the same true whatever (BA, OBA).  That is easy.  We use the binomial variance formula, p*q/n.  So let’s assume that they all have a true OBA of .5 just to make the p*q simple.

So the SD (square root of the variance) is sqr(p*q/n) or .05, which is 50 points of OBA.

Now, we can use that to do Tango’s (or Andy’s or whoever) thing where you look at a sample of real players with different true OBA and compute the observed variance and whatever the difference is between that and the one expected by chance (.0025 or .05 squared) is the variance of true talent among the players.  Of course your result is subject to sample error, so the larger the group of players the better.  And of course knowing this variance in true talent is A LOT of information and you can do a lot with it, not the least of which is figuring out how much to regress the sample OBA for x number of PA for one player in order to estimate HIS true talent OBA.

Now, when you do this (trying to figure out true talent variance from observed variance) in practice for anything (e.g., OBA, BA, UZR, wOBA), you rarely if ever have a nice sample of players, all with the same number of opporunities (in this case, PA).  Even if you establish a minimum number of opps to include in your sample, you will have numbers like 130, 400, 600, or whatever.  Not to mention the fact, “Is there any reason to use a minimum number of opps to include in your sample, other than the fact that maybe for a part-time player, his sample OBA does not reflect his true talent OBA.  If he is primarily a pinch hitter, for example, it might represent his true talent as a pinch hitter, but if most of the other players are full-time players, then you are mixing apples and oranges in your samples, which is not a good idea (you probably want to figure out the variance of true talent for either full-time players or pinch hitters, but not a mixture).

But I digressed a little.  And I’ll get back to whether you should have a minimum PA for each player in your sample of players.

Anyway, as I said, you rarely have a nice sample of players, each with 100 PA.  I suppose you could randomly sample each player’s PA’s 100 times and then you would.  Of course, you would be eliminating a lot of good information, possibly for no reason.  But, you would know exactly what the expected by chance is if all the players had exactly 100 PA, right?  .05 or 50 points in OBA.

In reality, you have all kinds of numbers of PA, each with its own expected variance by chance, p*q/n, with n being different for each player.

I have always wondered how the heck to compute the expected variance of a bunch of samples, each with a different n, using the binomial formula, which is what we need to do for these type of exercises.  What I usually did was just look at the average n in my group.  Say I had 3 players, one with 50 PA, another with 70 PA and another with 180.  My average PA is 100 of course, do I just assumed that the expected variance by chance was .05, what it would be if all the players had 100 PA.  I didn’t like it, and I tried not to have a wide difference in number of PA, which is why I usually establish a minimum and use the same time frame for each player, but I did it anyway.  I kind of assumed that the variance was a little higher than that if everyone had the same PA.  Or maybe lower, but definitely not the same.  I suppose I could have taken the average variance.  So, for example, the player with 180 PA would be .00139, the one with 50 would be .005 and the one with 70 would be .0037, for a simple average of .00332, a little higher than the variance if they all had 100 PA, which would be .00332.

Of course, it probably makes no sense to use a simple average since if one player in my sample had 2 PA, that would really screw things up.  If we do a weighted average (by each player’s PA), we get .005, which is exactly the same as if we assumed all of them had 100 PA.  Could that be right?  Is the variance in OBA for a group of players who each have different number of PA (all with the same true wOBA of course), be exactly the same as if they all had exactly the same number of PA, and those same PA’s were the average of everyone’s PA?

IOW, if we have 10 players, all with the same true OBA, like thus:

A 23 PA
B 112 PA
C 240 PA
D 190 PA
E 10 PA
F 143 PA
G 36 PA
H 445 PA
I 87 PA
J 714 PA

is the expected variance the same as if all players had exactly 300 PA (the simple average of the above is 300)?

I had no idea until I ran a mini-sim on my computer.  As it turns out, it does NOT matter what the number of PA of each player is.  Just take the average number of PA and use the binomial formula and you get the exact expected variance by chance for your entire group of players!  That is the same result, BTW, as taking the weighted average of all the expected variances for each player’s PA.

Some of you probably knew this, but I sure didn’t!  I always got bent out of shape when I had a group of players with vastly different number of PA.  That is one reason why I did y-t-y correlations rather than the “observed variance minus expected variance by chance” method.  It gave me a better opportunity to get PA’s that were closer in value for all my players.

I will not do that again.  I see no reason to do y-t-y correlations when you can simply look at the observed variance for all the players you can muster in your sample.  That is going to be a lot more robust.

I also see no reason to not use everyone, even those with a few PA’s, other than the aforementioned “pinch hitter” (full-time/part-time) reason.

Now, one more thing.  Let’s say that every player did in fact have the same true OBA.  IOW, there is zero spread of talent in OBA in the majors.  If you look at any large sample, you are likely to get an observed variance equal to expected variance by chance, right?  Wrong.  Why not?  Because the true OBA of each player, even though every player has the same one overall, will fluctuate from PA to PA and in particular from game to game (different pitchers, etc.).  I always wondered if that increased the “expected” variance.  I am poretty sure it does, which means that we are getting estimates of true variance in skill among the population that are too high.

We need to know how to adjust for this.  I have never seem any numbers.  I am going to employ my mini-sim and fluctuate the OBA for each player a little from PA to PA or 5 PA to 5 PA, or something like that (something to simulate different pitchers, parks, weather, etc.), and see how much if affects the expected variance even if each player has the same true OBA.  We really need to know that before we can declare what the variance of true talent is in hitting or pitching (it is not so important in other things like pitching), or among whole teams, or before we can come up with a reliable x/(PA+x) formula for regression.  Has anyone else ever considered this and tried to make the adjustment?


#4    MGL      (see all posts) 2007/11/16 (Fri) @ 20:24

O.K., I read the whole thing and it is a nice primer for those not familiar with this aspect of statistics and how it applies to baseball analysis.

And even for those who are familiar with much of the “theory” it is nice to see a comprehensive list of how reliable for a given number of opps each stat is, although I would like to know what the “opps” are for each stat (e.g., is that HR per PA or HR per BIP?).

A few things I don’t like.  I’ve said this before, and PC did explain it well, but I HATE the .7 r “reliable” threshhold.  Absolutely hate it.  One can talk all they want about what r or r squared means and how it relates to reliability and underlying sample sizes, etc., but there is NEVER any reason to talk about a certain magic number like .7. It is a meaningless number, just like the 95% or 99% confidence intervals for sample means or type I or II errors are meaningless in and of themselves.

IOW, you can say all you want about something that had a .7 correlation, or something that is more than 2 or 2.5 standard deviations away from the null hypothesis, but please, please, do not say that something is “reliable” because it is more than .7 in r or “significant” because it is 2 or 2.5 SD away from the mean.  That is ridiculous.  If you do, that implies very specifically that if something is .69 r it is NOT relibale, yet if it is .7 r it IS reliable, which is silly of course. 

I also wish he would have talked more about how to take that r and use it’s value to regress each sample value to estimate true talent rather than just looking at a sample and saying or thinking whether it is “reliable” or not or how reliable it is.  That kind of completes the whole discussion.  Without that I think the discussion is incomplete and leaves some readers scratching their heads or even worse, making some errouneous assumptions, such as, since the .7 threshhold is the beginning of “reliablitly” if I see a sample stat, say BA, that is below this threshhold, I should ignore it and if I see one above the threshhold, I should “accept” it.  That is a bad dichotomy for someone to think exists.  People should always think in terms of a regressed value, with the rergression being an exact function of the r.

Of course, as I have always said, social scientists like PC, tend to think in terms of these dichotomies rather than smooth functions and regressions.  Maybe I am wrong there.

Finally, I take exception to this statement:

Add that to the list of reasons why it’s silly to give out a batting title to the highest batting average in the league (apologies to Magglio Ordonez).

Unless you think that a batting title means “the best batter in baseball in a true talent sense” rather than “who hit the best this year in baseball,” there is nothing silly about it.  These awards are about who DID the best, performance-wise, not who IS the best or who is likely to be the best next year.  I am pretty sure everyone is aware of that and accepts it, although analysts do tend to forget it every once in a while.  For example, when an analyst (or writer, like Neyer) starts to quote context neutral stats, like VORP, in talking about MVP’s, now THAT is silly.

Oh, and a question for Pizza or anyone:  I don’t understand the distinction between the correlations for batters in general and for individual batters.


#5    tangotiger      (see all posts) 2007/11/16 (Fri) @ 22:45

Your simple average was 200, not 300.

***

The way Andy does it is to take 1/PA for each of those 10 hitters, average them, and then take the recipricol of that.  In your sample, that’s 47.

In short, the variance you would expect from your 10 players is exactly the same as you would expect from 10 players with 47 PA each.

I’m pretty sure that’s what he did, and he showed me the proof.  I’ll have to look again, though I suppose I can simply sim it.


#6    Phil Birnbaum      (see all posts) 2007/11/16 (Fri) @ 23:04

That “average the 1/x’s, then 1/x it” is the Harmonic Mean, right? 

Here’s another application: suppose checkout lady A takes 5 minutes to handle a customer, and checkout lady B takes 3 minutes.  What’s the expected customer time?

The average isn’t 4 minutes, because, B, being faster, handles more customers than A.

Instead, you average 1/5 and 1/3, giving 4/15, then invert, giving 15/4.  The average customer takes 3.75 minutes.

Throwing that out there in case someone can make the leap from checkout lines to variances and give us an intuitive proof.


#7    MGL      (see all posts) 2007/11/16 (Fri) @ 23:35

I don’t know if Andy said that with regard to variance (he said something like that with regard to weighting a pair of data, rather than taking the min of the two, which we usually do as a shortcut), but I think that is wrong.  As I said, if you sim it, you will see that the simple average is the correct way AND the same as doing a weighted average of the variances assuming each player’s PA is the “n”.

I am now going to see how much variation from PA to PA affects the variance.  Any guesses (you’ll have about 10 minutes from the time I post this)?  Say that we have a true OBA of .350, so a variance in 300 PA of .00076.  I am going to make the true fluctuate…

Now I realize that is easy to figure out.  The expected variance should be the binomial variance assuming a static p (.350) PLUS the variance of the fluctuating p due to pitchers, parks, weather, etc.  I think.  I also think that is going to be a lot, isn’t it?  I mean the range has got to be .260 to .440 depending on the context.  Call that 3 SD.  That would make one SD 30 points which is going to double the overall variance, assuming that we just add the variances.  That doesn’t sound right.  It won’t double the variance.

Well, any guesses?  I am going to sim it just throwing out some reasonable fluctuations every 4 PA or so, modeling a different pitcher and sometimes park every game.


#8    MGL      (see all posts) 2007/11/16 (Fri) @ 23:42

Time is up for the guesses.  Actually I forgot to click submit on my last post.

For the sake of simplicity, I did this:  Every 4th PA, I changed the true OBA.  The average OBA for all players was .350.  If they have a true .350 OBA every PA for 300 PA, the variance is .0007583 of course.  That is the binomial (p*q/n) variance.

If every 4th PA I make the true OBA to either .300, .400 or .350, with an equal chance of one or the other (so the mean stays at .350), it only increases the variance to .000771, a lot less than I would have guessed.  That is a 1.7% increase.

If every 4 games, I make the true OBA either .250, .350, or .450, a pretty large fluctuation I would think, it raises the expected overall variance to .00083, a 9.5% increase, still pretty small.

So, when we do these calculations, I would think that we would want to raise our expected variance by chance by around 1.5% for pitchers and batters to account for the PA to PA and game to game fluctuation in true OBA (or whatever we are measuring), otherwise we are going to overstate by a little the variance in average true talent among players (although I suppose the overstated number is in fact the variance in true talent within each player as opposed to across players).

How does that sound?


#9    MGL      (see all posts) 2007/11/16 (Fri) @ 23:45

I mean that the overstated variance would be the variance within AND across players.

Of course, we are also measuring the variance within players across time, if we use large (multi-year) samples, or even as the season progresses (some players get tired).  So maybe 2% is closer to what we should use to adjust the expected variance assuming that we really want to know the variance of true talent ACROSS players and not across AND within players, which IS usually what we want to know and which will get us the proper x/(PA+x) regression formula.


#10    tangotiger      (see all posts) 2007/11/16 (Fri) @ 23:52

You definitely want to add the variance due to the opposing pitcher and park, but as so far as the pitchers are concerned, it will be almost zero.

You have to start from the standpoint of the hitter.  Imagine each hitter always faces either Cormier or Peavy.  So, rather than a mean of .340, 1 SD = .030 for your population of hitters, half the time you will have a mean of .240, 1 SD = .030, and half the time will be a mean of .440, 1 SD = .030.

But, what will your observations of your hitters are going to be?  No different than if they faced a league average pitcher 100% of the time.

You are taking the variance of the mean OBP of each batter, and not the variance of each hitter’s sample OBP for each game.


#11    Pizza Cutter      (see all posts) 2007/11/17 (Sat) @ 02:13

A few responses to issues brought up:

Tango/main post: Means can be had (I’ll put them in a separate post), I suppose although my goal was actually more to establish minima, since that’s the way that I (and others?) usually conduct research.  Also, I’m not familiar with the derivation of your (PA/PA+X) model for correlation.  I understand it’s an asymptotic model, but how are the X-factors determined?

Chris/2: Random subset sampling would have been a little better (although most of what I sought to solve in even-odd splitting would be just as solved by random sampling).  It was a technical decision based on the limits of my software.

MGL: As a therapist, I can appreciate gray areas.  Dichotomoizing anything takes away from the richness of the data.  I realize that .7 is, in the end, an arbitrary point, although I think it’s the most defensible one.  The problem with sitting around and appreciating all the nuances is that it’s hard (though not impossible) to live that way.  “Can I trust this stat?” is a yes/no question.  Now, your suggestion of incorporating this into a regressed projection is a good idea.  In something like the Marcels, or anything else that has a regression to the mean component in it (and Tom’s right, they all should), this type of methodology can give a better idea of how much to regress to the mean.


#12    tango      (see all posts) 2007/11/17 (Sat) @ 02:52

Pizza, you calculate r as you normally would, with your intra-class correlation.

For that r, you need the mean PA for your sample.  Let’s say you have r=.70 for mean PA of 500.

Solve for:
r = PA/(PA+x)
.70 = 500 / (500+x)

x = 214

So, your general equation becomes:
r = PA/(PA+214)

So, if you happened to have 1200 PA, then we would expect an r of .85.


#13    Pizza Cutter      (see all posts) 2007/11/17 (Sat) @ 04:10

The means—Remember a few things about them.  In the article, when I say minimum 50 PA, what I really mean is that the player had 100 PA (two 50 PA samples to compare to one another) over a two year period (not necessarily 50 one year, 50 the next… all 100 could be in year 1).  So the means here can be best read as “When I say minimum __ PA, that represents an average of two sets of __ PA per player in the sample.  (Did that make sense?)

PA Mean Harmonic Mean N
> 50 314 168 1720
> 100 382 283 1345
> 150 412 335 1205
> 200 449 393 1042
> 250 479 437 922
> 300 507 478 810
> 350 533 512 712
> 400 555 539 626
> 450 582 572 516
> 500 606 601 422
> 550 630 626 329
> 600 659 657 214
> 650 684 684 127


#14    Guy      (see all posts) 2007/11/17 (Sat) @ 08:37

So who’s right about the expected variance in a sample of varying PAs?  Has big implications for calculating true talent levels.

My intuition is that Andy is right.  I can’t see why weighting by PAs—which is what MGL is doing—will give you the right answer, since each actual player result will get equal weight when you do your comparison.

Also, if you calculate the average variance for a set of PAs (sum of variances / # of players) and then take the SQRT, you get the Andy result.  I think that’s correct, but not certain.


#15    tangotiger      (see all posts) 2007/11/17 (Sat) @ 10:14

Pizza, can you post your WPA/LI and OPS of the players with a “min of 500 PA” (the 422 players).

Also, how did you calculate LI?  Is it mine, or are you deriving it based on empirical-only data?


#16    MGL      (see all posts) 2007/11/17 (Sat) @ 17:15

Guy, I am 99% certain it is me that is right, assuming we are talking about the same thing.  I set up a mini-sim and that gave me the answer.  Just for the record, I am not sure that Andy DID say that the variance of a group of players with different number of PA equals the variance expected by the binomial theorem if all players had the same PA and that number of PA were the harmonic mean of all the PA.  In fact, I don’t think he said that.

And just to be clear, what I am saying is that:

the variance of a group of players with different number of PA equals the variance expected by the binomial theorem if all players had the same PA and that number of PA were the simple average (mean) of all the PA.

I am also saying that is ALSO equal to the weighted average of the variances for each player using each player’s PA as the denominator in the binomial variance equation.  But there is no need to do it this way.

Try writing a simple computer sim.  Make sure it works by first having X (where X is large, at least 10,000 or so) players with 100 PA each.  Set the true p to whatever you want, it is easier to set it at .5.  Now have the computer spit out the variance of 100 PA OBA sampled X times.  It should be .05 of course.  If it is, your sim is working correctly.  (Obviously what you need to do to generate the 100 PA OBA’s is spit out a random number 100 times.  I use a random number between 1 and 100 and if it is less than 51, it is a “hit”, otherwise an out.)

Anyway, now do the same thing but have some players bat x number of times (rather than 100) and some players bat 200-y number of times, so that the average number of PA is 100.  Now compute the variance again.  It was still .05 when I did it.  I tried different combinations of player PA’s.  The variance was always the same as it would be if all players had the same PA and that PA were the average of all the players’ PA.

Tom, I am not sure I follow you with the pitcher thing.  I will have to think about it.


#17    MGL      (see all posts) 2007/11/17 (Sat) @ 17:16

The variance for 100 PA is .0025 and not .05 of course.  I am used to working with SD and not variances.


#18    Guy      (see all posts) 2007/11/17 (Sat) @ 17:32

MGL:  Think about this example:  19 players with 50 PAs and one with 1,050 PAs, avg = 100.  It seems clear to me that the variance will be little different than if you had 50 PAs for all 20 players—the very small variance for one guy in 20 isn’t going to have much impact. Heck, you could assign him the true mean and it wouldn’t matter.  And Andy’s method says this is equivalent to having 20 samples of 52.5 PAs.  Doesn’t that seem right, as opposed to saying this is like having 100 PAs per player?

I think there’s something wrong with how you’re doing the sim, but not sure what.


#19    Guy      (see all posts) 2007/11/17 (Sat) @ 18:18

Also, for the variance to be that of the mean sample size would imply that variance is proportional to sample size.  But of course it isn’t:  each additional increase of 50 (or whatever) PAs buys you less accuracy (a smaller reduction in SD) than the previous 50 PAs.


#20    MGL      (see all posts) 2007/11/17 (Sat) @ 20:05

Guy, you are right!  It always helps to look at an extreme situation, which you did.  To carry that further, what if one player had an infinite number of PA.  The average number of PA would be infinite, implying an overall variance of 0.  But only one of the 20 would have a variance of zero, the rest would have larger variances.

Anyway, I did in fact screw up the sim. I fixed it and ran your scenario, 19 players with PA of 50 and 1 with 1050.  The overall variance was .0047.  If they all had the same PA, that would imply a PA of 53.85.

If Andy’s method implies a harmonic mean of 52.5, this is close but still not exact.  I think I ran enough trials in the sim (100,000) to get the standard error down enough so that we can “trust” the .0047 variance.

It does not appear as if the harmonic mean is the exact solution but it may be close enough.  I would have to look at different examples to see.  It probably is.

I would still like to know the exact solution.  Surely Andy or someone else (who is a bona fide statistician) knows.

I think I overstated (with the 99%) my chances of not having a bug in my sim.


#21          (see all posts) 2007/11/19 (Mon) @ 11:13

The Retrosheet BIP data comes from Stats, no?  I was very impressed with the rapidity of convergence in the LD rate Pizza Cutter reported, in light of the difficulty in defining precisely what a line drive is.  I wonder if the BIS data would produce the same kind of convergence.


#22    Rally      (see all posts) 2007/11/19 (Mon) @ 13:02

The line drive league totals from STATS are very stable year to year.  BIS is not very stable.

I’ve also noticed huge differences in popups.  I think the difference is that what I’m getting from Hardball Times is only infield popups, while stats classifies a good portion of outfield flies as popups.


#23    Tangotiger      (see all posts) 2007/11/26 (Mon) @ 12:11

Pizza sent me his file, and I responded:

==========================================

Can you tell me how you calculate these two columns:
offense_wpa_sum..00 context_neutral_wins_sum..00

For example, let’s look at this hitter: oxspc001 (who happens to be a pitcher).  In 2005, he went 0 for 2.  You have his WPA as PLUS .03 wins, and his WPA/LI as PLUS .11 wins. Clearly impossible.

I’m going to guess that you are not using Markov-generated win expectancy numbers, but rather empirical based for the year in question?  You can’t do that, for obvious reasons like above.

I’m not sure what LI you are using, but again, you need to use Markov-generated ones, like on my site.

Furthermore, since OBP and SLG are “rate” stats, you have to set WPA and WPA/LI to rate stats status.  And that means dividing each of these by PA.

To be clear:
sum(WPAi/LIi)/sum(PAi)

Where you divide each WPA by each LI (on a PA-by-PA basis), and then you divide by the total number of PA.

Some evidence that you are doing it the wrong way is when I look at Shea Hillenbrand.

http://www.fangraphs.com/statss.aspx?playerid=196&position=DH

His WPA/LI for 2005 is +0.18, and in 2006 it’s -0.50.  So, very close to average.  However, you have him as -1.2 and -1.8 (for the even games of those two years, and odd games).


#24    Pizza Cutter      (see all posts) 2007/11/26 (Mon) @ 14:12

You’re most of the way right on your assumptions.  I do derive my WPA and LI values empirically, although for these, I’ve used the aggregated file of all events from 2000-2006, some 1.3 million events.  The real reason that I don’t use Markov models is that I have very little experience in working with them.  I understand the concept well enough, but I don’t (think I) have the programming capability to do one in the stat program I use.

I looked into the Oxspring incident (sounds like a book I read in high school), and it seems to be symtpomatic.  Oxspring was a pitcher whose two plate appearances came when his Padres were being throroughly whacked by the Brewers in 2005.  Oxspring came to bat with his team down 7 in the 3rd, and then down 11 (after a run diff of 8 or more, I just lumped everything together) in the 6th.  Those are fairly rare situations, and it’s rare-er still that a team would come back from there, but one little uppity team at some point (The Indians on 8/5/2001?  I was at that game!) could cause a hiccup in the WPA’s.  I’m guessing that the differences between the two methods can be traced to that.  Shea Hillenbrand I’m at a loss to explain.  I’ll do a deeper root canal later on.

On WPA as a rate stat, I’ve used it before as a rate stat and it didn’t much make a difference when I’ve done y-t-y.  Maybe it makes a difference here.  I’ll check it out.


#25    Tangotiger      (see all posts) 2007/11/26 (Mon) @ 15:00

The significance to the third digit occurs when you have 1 million samples *for that state*.  That is, with say man on 3B, 2 outs, down by 1, in the bottom of the 8th, you need 1 million occurrences to get significance to the third digit.
1 SD = sqrt(.5*.5/1000000)= .0005.
At 2 SD, that gives you +/-.001. 

Since there hasn’t been 1 million MLB games played in its history, you won’t get it for any one state.

The “.5*.5” would be whatever win probability is for that state, like “.9*.1”.  It doesn’t change the basic premise much.

***

Are you also saying that you are correctly doing sum(WPAi/LIi), as opposed to the incorrect sum(WPAi)/avg(LIi)?  I’d bet that’s where your issue is.


#26    Pizza Cutter      (see all posts) 2007/12/01 (Sat) @ 19:19

Haven’t forgotten about this.  It looks like I’ve got a problem with my WPA numbers (which I then used to calculate my leverage numbers.) I isolated Hillenbrand’s numbers and found that there were 48 of his hits over the 2005-2006 period that resulted in a net loss of WPA for his team.  This, of course, makes no sense.

For the time being, the WPA and CNW numbers are an open question.  Disregard what I wrote about WPA and LI in the original article for now.  More investigations to come.  The rest of the article (on the rest of the stats) should still stand.


#27    tangotiger      (see all posts) 2007/12/01 (Sat) @ 22:16

Thanks for keeping on top of this.

There’s two places that you have to be careful with Win probability.  One is when you get to 3 outs (you get to the next team’s inning).  The other is the home/away win probability (perspective of home team or batting team).

Looking at your overall numbers, I don’t think this is the case, but I thought I’d point out the things that I have to do “special” programming for.

It may also simply be the case of using empirical data, instead of Markov.


#28    Tangotiger      (see all posts) 2008/01/07 (Mon) @ 11:34

Pizza checks in with pitchers:
http://mvn.com/mlb-stats/2008/01/06/on-the-reliability-of-pitching-stats/


#29    Tangotiger      (see all posts) 2008/01/07 (Mon) @ 15:45

I have a back-and-forth with Pizza starting at post 19 here:
http://www.baseball-fever.com/showthread.php?t=72016

I’ll reprint my comments, along with Pizza’s final response:

Pizza, consider making your threshhold r=.50, rather than r-squared.

The reason is that at r=.50, regression toward the mean is 50%. It provides a convenient benchmark to say that at that level, you want half the player and half the league.
..
Yes, it is of course a matter of taste. However, 2 points.

1. At the higher r-squared, you will necessarily have less data points.
2. You can flip between the two rather easily.
r = PA/(PA+x), where x is the threshhold point

If for example you find that at r-squared = .50 (r=.707) that the threshhold level is 600 PA (which in my case, I mean to say the mean threshhold level, not the minimum as you are doing it), that is enough for me to tell you what the required mean threshhold level for r=.50

.707 = 600 / (600 + x)
Solve for x as 249

This means that if I wanted an r=.50, your mean threshhold level would be PA=249.

You can work it out with your sample and you’ll see this to be true.

So, either way (whether you establish your threshhold for r=.71 or r=.50), you end up with the exact same equation. Given the sample size issue, and the double-meaning of x=249, I find it much easier to explain and show based on r=.50.
...
I don’t like Pizza’s “minimum threshhold” to begin with. It only works within the confines of his dataset. For example, if you had a range of 0 to 1000 PA, his minimum threshhold might be say 600 PA. But, if you had a range of 0 to 2000 PA, I’ll guarantee you his minimum threshhold will be less than 600 PA. And if it was 0 to 10,000 PA, his minimum threshhold will be much much lower.

What you want to know is the mean PA (or the harmonic mean PA).

So, I think Pizza does two things imprecisely:
1. using “minimum” threshholds, that can only be “minimum” within the confines of the range in question (i.e., he should say “min 600 PA, if your range is 0 to 1000 PA, with a distribution of 25% of the PA greater than 700 PA”, or whatever it is). Most people will simply only look at the “minimum” and forget the remaining qualifiers.

2. using such a high r level to establish the threshhold, since that lowers your sample size unnecessarily

Selected quote from Pizza:

Your first critique on people’s assumptions on minimum threshholds is well-taken. You’re right that my numbers work only within the realm of seasonal data sets, although the player-season is the most common unit of analysis out there. I could re-run the same type of analysis if we wanted to look at career numbers or 5 season periods or whatever. My goal in the minimum threshhold part of the piece was simply to be a bit more scientific when throwing out the “cup of coffee” guys in studies. The harmonic mean would be nice to know, but that’s generally not how people conduct studies.

You’re technically correct on the sample-size issue with the higher cutoff, although that issue can be inflated away by sheer volume. Sure, there are fewer guys who get at least 600 PA than there are 300, but if you string together a few years in your data set, you can find 200-300 player-years fairly easily. Selective sampling issues aside (which I think is the bigger critique of a higher cutoff), I’d argue that you can put together a fine analysis with 200-300 player-years. The difference in statistical power between an N of a few hundred and an N of a thousand is relatively minimal. Plus, I would make the argument that having a few hundred pieces of data and being conservative on the reliability side will produce much better results than a larger sample with a more liberal criteria for reliability. Measurement fidelity is something that is vastly overlooked both in Sabermetrics and in psychometrics.


#30    tangotiger      (see all posts) 2008/01/08 (Tue) @ 11:36

Cross-posted to Pizzaa’ blog:

====================================

Pizza, ok great.

At 750 “PA”, you’ve got K/PA with an r=.873

Using the equation:
r=PA/(PA+x)
.873=750/(750+x)
we get an x=109

So, our general equation for correlation of K/PA is:
r=PA/(PA+109)

So, if you have 300 PA, we can estimate the likely r.  Using the above equation, and we get r=.73.  Your sample data shows .82.  I’m not happy with this difference.

***

Let’s continue with BB/PA.

Using 750 PA, the equation is
r=PA/(PA+201)

So, at 300 PA, we’d expect r=.60.  Your sample shows r=.60!  Bingo!

***

We CANNOT do K/BB.  That is a ratio of two independent events.  Unless you take the log, you need to reform it as: K/(K+BB), and then run the regression.  You need to create a rate stat if you want to apply linear regression.  Otherwise, why not do BB/K?  You’ll actually get different results.

***

At 750 PA, the equation for HR/PA becomes:
r=PA/(PA+1572)

If you have 300 PA, we expect r=.16.  Your sample shows r=.26.  Again, not happy here.

***

At 750 PA, the equation for 1B/PA becomes:
r=PA/(PA+679)

If you have 300 PA, we expect r=.31.  Your sample shows r=.34.  Pretty close.

***

At 750 PA, the equation for XBH/PA becomes:
r=PA/(PA+2415)

At 300 PA, r=.11.  Your sample shows r=.22.  This is very inconsistent.  Your r is very close at both PA=300 and PA=750.  This certainly makes little sense, and you have some sort of bias in the data here, be it park, or whatnot.

***

Here is how the batted ball info looks like:

r at 750 PA Event
0.936 Line drives
0.905 Ground balls
0.862 Fly balls
0.764 Pop ups
0.207 HR/FB

the “x” Event
51 Line drives
79 Ground balls
120 Fly balls
232 Pop ups
2,873 HR/FB

expected r at 300 PA Event
0.85 Line drives
0.79 Ground balls
0.71 Fly balls
0.56 Pop ups
0.09 HR/FB

sample r at 300 PA Event
0.86 Line drives
0.82 Ground balls
0.78 Fly balls
0.59 Pop ups
0.15 HR/FB

result Event
Bingo!  Line drives
Pretty close Ground balls
Eh, not bad Fly balls
Pretty close Pop ups
a bit off HR/FB

I find presenting the general “r” equation as I am doing provides what you need for *any* level of PA.

***

Here’s another way to think about the BB/PA.  I took all the players with at least 2000 BFP, from 2001-2006.  That’s 158 pitchers.  I figure the zScore for each pitcher, from Brad Radke’s -11 standard deviations to Ishii’s +12 SDs.  The standard deviation of all those zScores was 4.408.  The average PA was 3596.

The r (which is likely the same intraclass correlation that Pizza is talking about) is r=1-(1/4.408)^2= .9485

Plugging this into:
r=PA/(PA+x)
and we get:
.9485=3596/(3596+x)
Solving for x=195

So, our BB/PA equation is:
r=PA/(PA+195)

At PA=300, we’d expect r=.606.  Pizza’s sample says r=.597.
At PA=750, we’d expect r=.794.  Pizza’s sample says r=.789.

That’s a huge bingo!

The advantage here is that it’s a supersnap to do in Excel.  Plus, you get an actual regression equation based on PA (or whatever your denominator is).


#31    Eli      (see all posts) 2008/01/08 (Tue) @ 16:36

Isn’t the “x” event that you calculate going to be different depending on which level of PA you’re looking at? If so, how can just picking one level of PA at random and calculating the “x” event give you a “general” formula?


#32    Tangotiger      (see all posts) 2008/01/08 (Tue) @ 16:50

If you follow the method from “Here’s another way to think about the “, you will NOT need any minimum PA qualifier.


#33    Eli      (see all posts) 2008/01/08 (Tue) @ 17:16

I’m kind of just talking about the r=PA/(PA+x) method more generally, which I never really understood. Am I right in saying that x is supposed to be the break-even number of PA, where r = .5, and var(rand) = var(true) = var(obs)/2? If so, then to calculate x…

r = .5 = 1 - var(rand)/var(obs) = 1 - (p*(1-p)/x)/var(obs)
.5 = 1 - (p*(1-p)/x)/var(obs)
(p*(1-p)/x)/var(obs) = .5
p*(1-p)/x = .5*var(obs)
x = 2*p*(1-p)/var(obs)

But there is no single var(obs) - it will be different depending on what level of PA you are looking at. So how do you know which PA level to use to calculate x to get the general r=PA/(PA+x) equation?


#34    Anthony      (see all posts) 2008/01/08 (Tue) @ 17:35

If you use a minimum of 2000 PA, you might get r=.9, with an average of 3000 PA per player.

.9 = 3000 /(3000 + x)
x = 333

Lower the minimum PA and the correlation goes down. If you use 500 PA as your minimum, you might get r=.7, with an average of 750 PA per player.

.7 = 750 / (750 + x)
x = 321

The results will be virtually identical.


#35    Tangotiger      (see all posts) 2008/01/08 (Tue) @ 17:50

Right, at r=.50, then var(true)=var(random)=var(obs)/2
r=0.5=var(random)/var(obs)

var(random)=(rate*(1-rate)/PA)

And r=PA/(PA+x)
r=0.5 implies PA=x

So....

var(obs)=2*var(random)
var(obs)*PA = 2*rate*(1-rate)

As the number of PA goes up, the variance goes down, in the EXACT same proportion.


#36    Eli      (see all posts) 2008/01/08 (Tue) @ 18:05

"As the number of PA goes up, the variance goes down, in the EXACT same proportion.”

But then how do you get one single x to put in the r=PA/(PA+x) equation? If you start from data on players at one PA level you will get one x, while if you use players at a higher PA level, you will get a higher x (because var(obs) will be lower).


#37    Pizza Cutter      (see all posts) 2008/01/08 (Tue) @ 18:46

Tango, might the discrepancy be accounted for by the fact that when I jump to 750 PA, the sample changes as well (it’s a select group of pitchers who get to face 750, rather than 300 batters)? 

Mariano Rivera is in the 300 group, but not in the 750, whereas Pedro Martinez is represented in both groups.  I could run the numbers so that the sample itself would stay consistent.


#38    tangotiger      (see all posts) 2008/01/08 (Tue) @ 19:03

Eli/36: your sample will always change.  There’s no reason to use a cutoff, since you would simply weight the one with fewer observations less.

Pizza/37: I wouldn’t be surprised if the sample changed somewhat drastically, especially if all the relievers end up dropping out.


Page 1 of 1 pages


Name (required)
E-Mail (optional)
Website (optional)

<< Back to main


Latest...

COMMENTS

Jan 09 16:41
Sabermetric Moves of the 2009 Pre-Season

Jan 09 19:56
Modeling Baseball Player Ability with a Nested Dirichlet Distribution

Jan 09 18:08
Line Drives

Jan 09 18:04
Challenging Nate Silver (and all other forecasters)

Jan 09 17:31
Cheers

Jan 09 17:14
Teaching sabermetrics at school

Jan 09 16:51
The first Hardball Times Annual available for download!

Jan 09 14:44
Vote for the Worst Player in MLB

Jan 09 12:29
Clint Eastwood is Archie Bunker

Jan 09 12:16
Mailbags on Parade