THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, October 26, 2007

Productive Outs - Finally!

By Tangotiger, 08:02 PM

I was hoping somebody would do this.  The sum of each event’s WPA/LI tells you exactly the impact of an event, without the effect of the leverage.  If someone can move someone over, WPA/LI will capture it.  And thankfully, Pizza was the one who took the plunge.  Next time someone talks about “productive outs”, Pizza has the list as to who can do it.  Hopefully, he can share a more extended list, with numbers.


#1    Pizza Cutter      (see all posts) 2007/10/27 (Sat) @ 00:45

If you want, I can post the whole list for 2006.  I need a Retrosheet file to make this happen, so the 2007 list won’t be available until that file goes up.


#2    Pizza Cutter      (see all posts) 2007/10/27 (Sat) @ 01:07

Done.  I provided an update on StatSpeak.  The link is at the bottom of the article.


#3    MGL      (see all posts) 2007/10/27 (Sat) @ 14:18

If you want “context-neutral” why not just use change in RE rather than WE divided by LI?  That is a minor point though.

I left this comment on his web site:

Pizza, before anyone gets all giddy about there being SOME skill at moving runners over, don’t you want to control for handedness, K rate,
and perhaps GB/FB ratio to see if there is any “productive out” skill AFTER accounting for these things?

For example, if the “skill” is completely (or mostly) a result of a player’s K rate (IOW, players who K a lot have a low index and players who don’t have a high index, which they obviously do) then I would be quite reticent to call that a “skill” with respect to making productive outs.  That is semantic though.  However, if someone says, “So-and-so is good at moving runners over,” I would want them to know, “Well, sure, but that is just because he hardly ever strikes out.”

I used to include “moving runners over” in my Superlwts numbers, but I found that a rigorous batting lwts formual already takes care of that, by using different out rates for K, fly ball outs, and ground out rates by handedness (a ground out by a left-hander is worth more because of moving runners up).  IOW, I don’t think there is ANY “productive out” skill to speak of, once you account for the above variables.


#4    tangotiger      (see all posts) 2007/10/27 (Sat) @ 17:10

The reason that WPA/LI is superior to deltaRE is that it takes into account the game situation.

Just as a HR with the bases loaded and 2 outs in the bottom of the 9th of a tie game is really no different than a single, you have to account for what it is that the event is actually doing.


#5    MGL      (see all posts) 2007/10/27 (Sat) @ 17:21

Then Pizza should not be calling it “context-neutral” and why divide by LI then?

In any case, dividing by LI seems to reduce it to the same thing as Delta RE.  Where am I going wrong?


#6    MGL      (see all posts) 2007/10/27 (Sat) @ 17:23

The only difference between WE and RE is inning and score, right?

Once you divide by LI, don’t you in effect remove the inning and score?


#7    tangotiger      (see all posts) 2007/10/27 (Sat) @ 18:30

You depress the leverage impact, but keep its context-specificity.

Think back to the example I said about bases loaded, botom of the 9th, tie game, 2 outs.  The LI is HUGE.

But, relative to the value of the out, the win impact of the single, double, triple, HR, and walk are all EXACTLY the same, and all at around 2.0.  That is, if you were to construct a wOBA equation for this situation, it would be:
1*1B+1*2B+1*3B+1*HR+1*BB / PA

i.e., wOBA = OBP.

But, if the bases were empty, 2 outs, tie game, the win impact of the HR would shoot up.

And, if it was bases loaded, 0 outs, the win impact of a flyball out would also shoot up.

So, WPA/LI is sensitive to the game situation, without having the extra oomph that LI gives the situation.

Think of WPA/LI as exactly being wOBA for the game state.


#8    Pizza Cutter      (see all posts) 2007/10/27 (Sat) @ 19:23

MGL, I looked at two of the three issues about which you were concerned.  I re-ran the index, and put it alongside each player’s K rate and GB/FB ratio.  Because of some technical issues in my data base, I couldn’t do handedness.

Minimmum 25 relevant PA’s, Correlation between professional hitting index and:

K rate = -.012
GB/FB ratio = .173 (guys who hit more GB’s are better “professional hitters")

Still, those correlations are less than inspiring.  I tossed them into a regression and you’re picking up about 3% of the variance.  I saved the residuals and the residuals still have an ICC around .11, which is fairly close to the original ICC.  The factors which you suggest account for some of what’s going on, but not a lot.


#9    MGL      (see all posts) 2007/10/27 (Sat) @ 20:03

I am not too surprised.  I actually think that there are some batters who are more adept at going to right field than others, and certainly power hitters are less inclined to try and do so, whereas light hitting batters are expected to do so.

Also, I think that the biggest factor is handedness and that is the one you have not looked at, right?


#10    Pizza Cutter      (see all posts) 2007/10/28 (Sun) @ 00:15

I solved the handedness problem.  The big problem I was having was with switch hitters.  I solved it by splitting them into two people.  So, Victor Martinez gets an entry for his LH at-bats and RH at-bats in each year.

There was no significant difference between RH and LH when the stat was normalized per at bat, although there was one for seasonal performance (righties were a bit better).  I dummy coded handedness and tossed it into the regression with K rate and GB/FB.  R-squared is still around 3%.  The residuals from that equation check in with a ICC around .086.

There’s not much skill to begin with here (I’m personally of the “I like to see correlations above .70” school), but it looks like what little of it is there can’t be fully explained by K-rate, GB/FB, and handedness.


#11    MGL      (see all posts) 2007/10/28 (Sun) @ 01:55

I don’t know how ICC’s work, but the magnitude of year to year correlations depends entirely on the sample sizes of each data point (e.g., how many PA are in each data point per year).  If those sample sizes are small, an “r” of .20 can indicate tremendous skill.  If the sample sizes are large, then .2 could indicate very little skill.

IOW, I don’t see how you can say anything about the level of skill from looking at the “r” without comparing that to what the “perfect r” would be given the sample sizes.

I am of the school of the “.70 correlations mean NOTHING” unless taken in context.

Not to mention the fact that the “level of skill” means nothing unless it is in the context of the sample size I am talking about.

If there is NO skill, then the “r” will be zero regardless of the sample sizes.  Of course, as teh sample sizes get smaller, the more it will fluctuate and give us “false” readings (plus or minus something).

If there is ANY skill at all, then as the samples get large, the “r” will approach 1.  So I don’t know how you or anyone else can say that the magnitude of the “r” tells us anything about the level of skill, independent of the sample sizes.

And of course, with these types of regressions, there are 2 kinds of sample sizes.  One is the number of data points.  That is not the one I am talking about.  The size of that will only affect the uncertainty of the “r” not the magnitude.  The sample size I am talking about is the number of measurements with repsect to the underlying thing we are measuring, like PA, or BIP, or whatever.  As I said, the size of the resultant “r” is a function of the size of THIS sample.

You are virtually never going to get “r’s” in the .70 range when you do year to year correlations even for things that involve lots of skill, like batter HR rate or pitcher K or BB rate.  Of course, if you do 2 year to 2 year regressions or odd/even career regressions (or any other ones which increase the underlying samples), then the correlations will be a lot higher.  That doesn’t change the level of skill in the thing we are measuring though.  It DOES change the level of skill with respect to the sample sizes you are working with.

As I said, if you are working with year to year correlations in baseball, you are going to wait all your life to get correlations in the .7’s.

Again, when you do your ICC’s (which I am not familiar with), I don’t know what underlying sample sizes the regression is using and what the analogue (twin) would be if you were doing y-t-y’s.

I know it is probably the same thing, but don’t you just want to do 2 regressions, one for LHB and one for RHB rather than doing one big regression and putting in dummy variables for handedness.  Isn’t that a lot easier, more accurate (maybe not) and easier for someone to understand?


#12    Pizza Cutter      (see all posts) 2007/10/28 (Sun) @ 09:32

You can think of ICC’s as year-to-year correlations writ large.  Where yty can only take into account two data points at a time (since it’s a bivariate correlation), ICC can take on an infinite number of data points, in theory.  (In practice, that’s a bad idea, but it could happen.) You’re correct that sample sizes play a role in this type of analysis, although once the data points are based on enough observations (In this case a minimum of 25 balls in play), and if the sample size (number of players under study) is above 100 or so (in this case, it’s around 300), correlations generally tend to stabilize.  There are tests that can give you exactly how much wobble we can expect in correlation at various N’s.  (Fisher’s Transformation comes to mind.) We can generally get this kind of sample size with just about anything we do in baseball.

.70 is my bias as a psychometrician.  When I do test-retest analysis (aka year-to-year) on a measure of depression or attitudes toward child abuse or coping strategies, I need to get .70 or people will laugh at me.  That’s the industry standard where I work.  But, it also makes sense.  It gives an R-squared of 49% (roughly half) so we know that anything above .70 means that the majority of the variance is consistent from year to year, rather than being due to chance or other unexplained factors.  With that said, it’s a matter of how much variance you want consistent across time.  I’m not dogmatic about .70, but I feel a lot better when I see it.

For what it’s worth, pitcher walk rate has an ICC of .609, K rate is at .754, pitcher HR rate is at .364 (not park adjusted).  Speed scores, both James’ and my own, had ICC’s of .70 and .77 respectively.

I think the question here is what we mean by “skill.” I generally use it to mean “replicable performance.” (In which case, a perfect R is 1.00) For example, if a batter hits .300 one year, he’ll hit pretty close to .300 the next year, and we can bank on that pretty well.  A correlation of .20 means that 4% of the variance is shared from time one to time two.  It also means that standard errors of estimation from year to year will be high.  Sure, it means that someone’s only regressing back 80% to the mean next year at an aggregate level, but on an individual level that standard error is going all over the place.

I think it what it tells us is that baseball is much more of a game of chance than anyone would like to admit.


#13    MGL      (see all posts) 2007/10/28 (Sun) @ 16:20

I’m sorry, I don’t follow what you mean by the “correlations will stabilize.”

As I said, the correlations are purely a function of the sample size of the underlying data within each data point in the regression.

If I am asked what is the correlation for a pitcher’s K rate, there is not answer to that unless they specify for how many TBF.  If I run a regression from one year to another (y-t-y), I might get .5.  If I run 10 years to 10 years, I might get .7.  If I run month to month, I might get .2.

So, again, I ask, how can you tell me that the ICC correlation is “x” without telling me for how many TBF that correlation represents?

There is no such thing as simply a “correlation” for a pitcher’s K rate, or BB rate, or HR rate, or BABIP.  That is like saying, “What is a pitcher’s K rate if he has 142 K?

You can only ask for a correlation of a pitcher’s K rate for 100 TBF versus 100 TBF, or 100 TBF versus 1000 TBF, or however you got each data point in the regression.  When we say a “y-t-y”
regression or correlation, even then it makes no sense unless we specify around how many TBF or PA (observations) per year.  You tend not to have that situation in the social sciences and psychological sciences.

I realize that you know 1000 times more than I about statistics Pizza, but you cannot tell me that a correlation such as this, which is based on a certain number of observations (PA, etc.) for each data point in the regression, “stabilizes,” unless you tell me how many observations you are talking about.  Giving me a correlation of “x” in an ICC cannot possibly mean anything unelss you specify the exact or approximate or average number of observations in each data point.

Maybe Tango can join in here, because I think he knows what I am talking about.


#14    tangotiger      (see all posts) 2007/10/28 (Sun) @ 19:26

Right, I’m with MGL here.

If I understand what Pizza is doing, it’s exactly what I do with correlations:
var(observed) = var(true) + var(error)

r= 1 - var(error)/var(observed)

You get an implied year-to-year, by only using one year of date.  For example the standard deviation of historical team wins per game is .072.  Anyone can figure this out, just by using sample data points.

The standard deviation of the binomial (luck) is .039.  Anyone can figure this out as sqrt(.5*.5/162).

The correlation, r, is simply 1 minus (.039/.072)^2 = .70

That is, if you looked at team records in one year and looked at it in another year, you will find a correlation, year-to-year, of .70.  And we can figure this WITHOUT looking at multiple years.

However, and this is what MGL is talking about, it’s completely dependent on the number of game (162) and not the number of data points (30 or what have you). 

If I had 1620 games in a season, the r would be much higher. 

And in fact, you take two of anything, if they have the slightest possibility of any kind of relationship whatsoever (like clutch hitting), and you can get a correlation to approach r=1.0, if your sample size is large enough.  And you can take two things that are very very highly correlated (say player’s pure speed and the number of stolen bases), and if your sample size is small enough, your correlation r will approach zero.

That’s why, when I give a correlation equation, I always do something like
r = PA/(PA+200)

That tells me that if I have 200 PA, my correlation, year-to-year will be r=.50.  If I had 1800 PA, r=.90.  That’s why it’s absolutely critical to tell the reader what your sample size is.  An equation like above makes it abundandtly clear.


#15    MGL      (see all posts) 2007/10/28 (Sun) @ 20:06

Yes, you explained it better than I could.  So, as I said, wanting a correlation of .70 or above means nothing unless you specify an underlying sample size.  And even then, what constitutes a “good” or “strong” correlation?  If the underlying sample size is small, even for something which is very much skill related, the correlation will be small and if that sample is large, the correlation will be large even if there is very little skill (but some).

So we CANNOT talk about absolute correlations, and what is strong or weak, unless we are comparing it to something else.

Pitcher BABIP is going to look pretty strong if we do odd/even year career correlations for pitchers with at least 15 years in their career.

Conversely, pitcher K rate is going to look weak if we did a month to month regression (regardless of how may pitchers we use - as Tango said, the number of data points in the regression is only related to the uncertainty of the resultant “r”, not the magnitude of the “r").

So all I ask is that, if Pizza is doing an ICC or Tango is doing an “intra-class” variance analysis as he explains above, they tell us what the “time period” (year, month, etc.) or more accurately the underlying sample size (BIP, TBF, PA) is for the data sample. Even if you have different sizes for each data point (player), which is suually teh case, give us the average or so. Otherwise the correlation means nothing.  It only means that there is SOME correlation, the magnitude of which will depend on that underlying sample size.


#16    Pizza Cutter      (see all posts) 2007/10/29 (Mon) @ 00:32

When I say stabilize, I mean that once you get 100 players in your sample, the correlation is much less influenced by outliers than if it was, say, 5 players.  Plus, the confidence interval for the correlation shrinks (good thing).  The ICC’s I reported above are all at the year-to-year level.

The exact numbers of PA/BF for each player aren’t as important (although MGL’s request for listing the sampling window is well-taken), as it’s easy enough to use rate stats, (and here’s the big caveat) assuming a decent minimum of PA, again to bring that confidence interval down to appropriate levels.  What exactly that minimum should be might be a matter of some future investigations (split half reliability anyone?).  I generally pick my minima using the “sounds about right” test, which isn’t the best way to do such things, but as I am wont to say, we must sometimes sacrifice precision for direction in science.

I believe the issue here is that you’re wondering how many PA’s constitute an accurate measurement of a player’s performance/skill within that year (or any specified time period), which is a fair critique.  The answer is that I don’t know and I’m making (hopefully rational) assumptions.  Again, precision and direction.

But as to how strong a correlation is, I’ll grant there is something to it being a matter of opinion or comparison.  My cutoff for sleeping well at night is .70, for the reasons I stated earlier.  It’s an arbitrary cutoff, but one with (I think) some sound logic behind it.  If you’re happy with saying that X statistic is more stable than Y statistic, you can line up the R values and see which is bigger.  But while .15 is greater than .10, neither one gets my blood going, especially once we get into the business of trying to predict future performance with any accuracy.  The entire rationale behind DIPS is that because those year to year correlations were so low (in the .10-.20 range), BABIP was much less skill than luck, but K rate with it’s very high (around .7-.75) correlation was much more skill than luck.  There seems to be an underlying understanding that .15 is a low correlation while .70 is high and there’s some sort of qualitative difference between them.  It’s just a matter of figuring out some agreeable cutoff system.

I’m also confused as to how more observations would lead to a correlation approaching 1.0 in something that was, in fact, uncorrelated.  For any stat, there is some “true” amount of replicable skill (that is some “true” value for the correlation).  Central limit theory says that as you incorporate more observations, the closer the observed value approaches the true value (whatever it happens to be).  If two things are slightly correlated (let’s say the true value is .05), then more observations (at the player level) will lead to a correlation approaching .05.

As you increase the sample size in a correlation, the chances of finding _statistical significance_ do increase, but that’s because confidence intervals start shrinking.

Correlation is basically a measure of “how well do these data points fit into a nice straight line?” Two data points have a correlation of 1.0, by definition, as two points geometrically define a line.  Even if you had ten points that all fell perfectly on a straight line (r = 1.0), if you add in another point which didn’t fall exactly on the line, the correlation, by definition, will now have to drop away from 1.0) Placing a single outlier in a small data set will increase the correlation coefficient, but a larger data set modulates the effect of outliers.  What you’re saying seems to run opposite of central limit theory.

I can think of one possible way that this explanation would shake out, and Tom let me know if I’m going in the right direction.  Let’s say that players got a million PA per year and we measured something that was totally luck (clutch?).  In other words, when you say sample size, you’re referring to the window for measurement (within subjects), rather than at the “how many players do we have data for” (between subjects) level.  Eventually over a million PA, every player would have a clutch rating of zero (using the usual WPA - WPA/LI framework).  So, correlating everyone who has zero-zero at time one and time two would lead to a correlation of 1.0.  In this case, you’d be making the argument that 500 or 700 PA isn’t a reliable time sampling to give a true estimation of a player’s clutch “ability” that year.  That would be a fair statement, but not really all that practical.  We usually get 700 PAs with which to evaluate players.  Plus, while the correlation might not shake out at the within-subject level, it certainly shakes out at the between-subjects level.


#17    MGL      (see all posts) 2007/10/29 (Mon) @ 01:07

If there is ANY correlation for the things we are talking about (things that involve an underlying sample size, like PA, not the number of data points), then as the underlying sample size approaches infinity, the correlation will approach 1.0.  If there is no true correlation, then no matter how large or small the underlying sample sizes are (each data point has its own sample size, right?), the true correlation is always zero.

When you say that if the true correlation is .05 then the more observations you have, the correlation will still be .05 (with less and less variance in the observed correlation), that is true.

I have not been talking about the number of observations.  I have been talking about the sample of data within each observation (100 PA, 700 PA, whatever).  I already warned against confusing the two.  We are talking about 2 different things.  You are not in disagreement of ANYTHING I have said.  You are just misunderstanding what I have said.

You said that your ICC are year to year.  THat is fine and that answers my question.  But you say that the exact number of PA or BF are not important, but they ARE important.  Without that number (it does not have to be exact), the “r” has no meaning.  You can use rate stats if you want, but you are going to get completely different “r’s” depending upon the PA or BF for each observation in your regression or ICC or whatever it is.

When you do a regression, whether it be a normal linear regression with 2 variables and N observations, or your ICC, which I don’t completely understand (although I think I have a good idea), if each observation has a different number of “sub-observations” (PA/BF), I don’t know how that affects the resultant “r”.

IOW, I KNOW that if I regress a zillion (infinite) BA in one year and all data points had exactly 500 AB with another year with a zillion data points and each data point also had exactly 500 AB, I can confidently (100% condidence) state that the y-t-y correlation in BA for players with exactly 500 AB is .50 or whatever “r” I came up with in the regression.  But if I do the same regression with a zillion observations in each year, but some of those observations had 100 AB, some had 500, some had 700, etc., and I come up with an “r” of .5, I don’t really know what to say.  Maybe Tabgo can answer that question.


#18    Pizza Cutter      (see all posts) 2007/10/29 (Mon) @ 03:31

Ah ha!  The question that you are pondering is one of measure reliability (that’s the technical term from work), and it comes down to “At what number of PA does (insert name of stat here) become stable.” For example, if I know that if we picked 250 random PA and it did a decent enough job predicting a separate 250 PA in the same season, then we can say anybody above 250 PA has a stable enough sample.

For what it’s worth, in general I use 100 PA or BF as the minimum for my stuff unless I mention otherwise.  Again, that’s based on a “feels right” criteria.  I really should be doing the above analysis.

ICC goes kinda like this.  A correlation is found by taking the covariance of the two variables in question over the pooled variance of the two.  An ICC expands that a bit.  If I have four data points, they form a nice little covariance matrix.  I take how much variance is shared over time among the time points over the total pooled variance.  It also corrects for the fact that there will be a certain amount of auto-regression.  How exactly it does that last part, I have no idea.  I’d have to look in my book.


#19    Tangotiger      (see all posts) 2007/10/29 (Mon) @ 09:56

Darn it, I had a nice long post, and I guess it blew up.

What MGL is talking about is this:
var(observed) = var(true) + var(error)

The first and third terms are trivial to calculate.

The correlation, r, is equal to 1 minus var(error)/var(observed)

I think this is what Pizza is talking about with his intra-class correlation.  It works like this: historically, the standard deviation of team winning percentage observed is .072.  The binomial says the error for 162 GP is .039.  Realizing that variance is standard deviation squared, you get:
r = 1-(.039/.072)^2 = .70

So, without even having year-to-year data, you can infer that the correlation would be .70.

var(true) is solved as .060^2. 

Now, when would r=.50?  var(error) would have to be also .060^2.  And that happens when GP = 69.  That is, SD = sqrt(.5*.5/69)=.060

The correlation equation is therefore:
r = GP / (GP+69)

So, what MGL is talking about is that he wants to know the number of GP (or PA or whatever).  If GP approach infinity, r approaches 1.  If GP approached 0, r approaches 0.  You see, the correlation tells you NOTHING, absolutely NOTHING, unless you know the sample size.  (Not the number of teams, which gives you the reliability of the estimate, but the number of data points, G, PA, IP, for each sample.)

That’s why what people should do is present a regression equation as I do above.


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 09 19:10
Who’s evaluating the 2011 forecasts this year?

Feb 09 18:35
MGL: Today on Clubhouse Confidential

Feb 09 17:36
New PECOTA

Feb 09 16:38
The will of the people?

Feb 09 16:25
Correlation of pitcher metrics: FIP strikes again

Feb 09 11:56
Forecaster’s Challenge: 2012?

Feb 09 11:45
When is a life entity considered a person?

Feb 09 10:08
Change in fastball velocity by going from starter to reliever

Feb 08 22:41
Batman, the webslinger?

Feb 08 22:24
When to purposefully lose the lead