Friday, January 22, 2010
ICC or Intra Class Correlation
Warning! Long post!
Many of you are aware of some of the comments and assertions that Tango and I (and others) have made about regression correlations ("r") as a measure of the amount of “talent” that a particular metric reflects. For example, if you do a regression of a group of pitchers’ BABIP in one year on another year, especially if you limit your sample to pitchers who change teams from one year to the other (in order to filter out defense and park effects) you will get a very low correlation or “r”. What that means (assuming that our sample of pitchers is large enough so that we are pretty confident in the magnitude of that “r") is that there is very little “talent” associated with a pitcher’s BABIP. At least that is how we couch the “effect” in English.
What Tango and I have stated many times is that what we really mean by “very little talent” is that the spread of talent among the population that we drew our sample from (presumably in a random fashion) is small. What we mean by “small,” since words like small and large only have meaning relative to something else (a large cat may be the same size as a small dog, right?), I am not sure. Maybe it is relative to other pitching metrics - maybe it is relative to something else.
What I do know is that “there is little talent in BABIP among pitchers” is probably not the best way to couch the situation, since at some level below MLB, there could be a much larger spread of BABIP talent but that by the time MLB pitchers get to the show, they all have around the same talent, but not necessarily “no talent”, if you know what I mean. It is likely that there is a small spread of BABIP among pitchers at all levels, at least as compared to other baseball talent, for pitchers and hitters, otherwise we would probably see a larger spread at the MLB level, but that is not a given.
One other thing related to these kinds of regressions and correlations that Tango and I have written repeatedly about is the magnitude of the correlation or “r” (I’ll just call it “r” from now on) as a function of the underlying sample size. What we have said is that regardless of the spread of talent in the population of whatever skill or metric (the metric is actually a sample measure of that skill) we are investigating, the “r” that we get when running a regression is a function of two things: One, the underlying sample size of each element in the regression, and two, the spread (variance) of that skill or metric in the population that or elements are drawn from. One thing that is potentially confusing is the use of the term “underlying sample size.” When we do these kinds of regressions, there are two sample sizes: One is the number of players or elements in the regression, usually indicated as “i”. That affects the reliability of the results of the regression, including the “r”. That is not the sample size I am talking about throughout this post. When I say the underlying sample size I mean the number of opportunities that go into each player’s metric, such as AB for BA, or BIP for a pitcher’s BABIP. So if our regression contains 100 pitchers, each with a certain BABIP in one time period and then another time period, one “sample size” is “I” or 100, the number of elements in the regression. The other “sample size” or “underlying sample size” as I call it, is the number of underlying BIP for each of those 200 (100 pitchers times two timer periods, where time period 1 is the independent variable and time period 2 is the dependent variable, or vice versa). Obviously throughout this post, I am usually referring to the latter sample size or sizes (since usually there are many different sizes). In most social science research, the underlying sample size is one for every element in the regression.
Anyway, one of the more interesting things about the relationship between the size of the underling sample, the spread of talent in the population with respect to what we are measuring (BA, BABIP, etc.) is this:
If there is no spread of skill in the population (i.e. all players have the same true talent with respect to what we are measuring), the “r” will be zero when we run a regression, and:
If there is any spread of skill in the population, no matter how large or small, given a large enough underlying sample of performance, the resultant “r” from the same regression will approach 1.
Let me give a few examples:
Let’s say we are measuring baseball players’ true speed in 100 meters. We’ll start with the assumption that every player has a “true” speed (that is their “true talent” in speed) and that it doesn’t change over time. If everyone had the same true speed, whether we measured that speed once or 1000 times, if we ran a regression with, say 500 players, and we regressed one set of “times” (either 1 or 1000 or any other underlying sample size) on another set of times (again, either 1 or 1000 or any other underlying sample size), the “r” will always be zero.
Now, let’s say that among the population that those 500 players were drawn from, the spread of true speed were small (the variance is small). Again, when I say “small,” it depends on “compared to what,” but it doesn’t matter right now. Let’s say that one SD were .1 seconds, and that the mean were 10 seconds, so that 95% of all players in our population and presumably in our 500 player sample were between 9.8 and 10.2. That is indeed a narrow range. In reality, it is probably closer to a mean of 10.5 and a SD of .5 seconds (or whatever). Anyway, if we measured everyone once, we are going to get a lot of random (we’ll assume it is random at least) variation around everyone’s true time, because of the accuracy of the person with the stop watch, the weather conditions, the mood and condition of the person doing the running at the time the measurement is taking place, the condition of the surface being run on, etc., especially if those things are (randomly) different every time we do a measurement.
So if we measure everyone once AND the spread in true talent among the players is small, the random variability around each measurement is going to “swamp” the true variability in talent among the players, and the resultant “r” when we run the regression is going to be low.
However, if the spread of true talent is large, let’s say instead of a SD of .1 seconds it is 1 second, then even with the random variability surrounding each measurement, it will be fairly easy to distinguish the fast from the slow players, and the resulting “r” will be a lot larger even when if make only one measurement per player.
Now, suppose we do 10000 measurements per player in all kinds of conditions and we average them. The assumption is that the resulting mean time will be very close to each player’s true time. If that be the case, then whether the spread of true talent in the population is .1 or 2 seconds, guess what the “r” is going to be if we run a regression of one set of times (several thousand) on another? Close to 1 of course.
The same exact thing is true for BA or for BABIP among pitchers. If the spread is small or large, if we only measure it a few times - e.g., the number of AB in the BA is 5 or 10, or the number of BIP in the BABIP for pitchers is also 5 or 10 - if we run a regression of one time period on another, even assuming that true talent does not change over those time periods, the resultant “r” will be small. If the spread in true talent is small, the “r” will be smaller, but given a small underlying sample size, like 5 or 10 AB or BIP, the “r” will still be small whether the true talent spread is small or large. It might be .01 (small spread in true talent) compared to .03 (large spread in true talent) or something like that.
Now, if we sample BA or BABIP for each player over a large number of opportunities, say 500 or 1000 (the equivalent of a season or so for a full- time player), now our resultant “r” is going to be a lot larger whether the spread in talent is large or small. It might be .2 and .4, whereas before it was .01 and .03.
In fact, regardless of the spread of true talent in the population, for example, whether players’ true talent BA were between .260 and .280 only, or .200 and .350, if we sample them tens or hundreds of thousands of times once and then tens or hundreds of thousands of times again (the dependent and independent variables in our regression - one time period on another), the resultant “r” will be close to 1.
The moral of the story is that when I tell you that I ran a regression of a certain metric (and that metric has an underlying sample size, like AB, PA or BIP, or even the number of times I timed each player in a 100 meter dash) and I say that I got an “r” of .05 or .5, that number should mean nothing to you unless I tell you the average or the distribution of opportunities (underlying sample size) in my sample of players.
Even then, the meaning of the magnitude of the “r” doesn’t mean a whole lot unless it is compared to something else. For example, if I say that I am working with BABIP for batters and I get an “r” of .4 and my average underlying sample size is 300 BIP for each element in my regression, does that mean that BABIP for batters reflects a lot of talent, a little bit of talent, or a medium amount of talent? I don’t know. It looks like it is “medium,” but if my underlying sample size were 100 rather than 300, the “r” might be .2, which looks small, even though the spread of talent in the population or the “amount” of talent in BABIP for batters has not changed. And if the underlying sample size were 2000 BIP (3 or 4 years worth of data), the “r” might be .7 which “looks” large.
But, if I said that for 300 BIP for batters, the “r” was .4, and for 300 PA of OBP, the “r” was .6, but for 300 BIP, the “r” for BABIP for pitchers was .05, you might have a different opinion regarding the size of that .4 since you can compare it to other things we are measuring. You would probably say that there is more “talent” (the spread of talent is greater) in OBP than BABIP for batters, and a lot more talent in BABIP for batters than for pitchers. Everything is relative.
So, what does this have to do with the title of this thread? Well, what if we have bunch of players in our regression and each player has a different underlying sample size. Let’s say that we are using BA as our metric and some players have 50 AB and other players have 500 AB, both in one time period and the other time period (again, the dependent and independent variables)? Well, that’s a mess. We can do the regression, but regardless of the spread of talent with respect to BA in the population, the players with 50 AB in either one time period or both time periods will help to create a very small “r” and the ones with 500 AB in both time periods will help to result in a much larger “r”. So the overall “r” when we run the regression will be somewhere in between and we can report that number and then report the average size of the underlying sample. I don’t know enough about statistics to know whether the resultant “r” for a group of players where the underlying sample sizes vary a lot but average to 300 AB will be the same as a group of players who all have underlying sample sizes of 300 AB. I suspect that they will be similar but not exactly the same and I also suspect that the ones that vary in the number of AB will have a smaller ”r” even though both groups have the same mean number of AB per player.
I suppose we could just eliminate the players with small underlying sample sizes, but if we do that, we are deliberately ignoring valuable information such that our regression and resultant “r” is going to be cleaner but less reliable (have a larger standard error) because we are effectively reducing the sample size of the regression.
Or, what if we just have a bunch of players from all different time periods and we want to determine some kind of correlation with respect to a particular metric? Do we try and create a regression of one time period to another? What if we have some players with 3 or 4 time periods and others with only 2? Again, what if some of those time periods are comprised of 10 AB and others are 1000 AB? What if we have the results of every AB (very valuable information) for a very long time for some players and not so long for other players? Again, we can aggregate the data and then run one or more regressions of one time period on another. Regardless of what we do, it is still a mess.
Typically, what a lot of researchers do, including myself, is organize and aggregate the data so that we run a regression of one time period on another for several time periods while restricting the underlying samples to some minimum number. For example, we’ll take 10 years of BA data and run a regression of 2000 on 2001, 2002 on 2003, 2004 on 2005, etc. for all players with a minimum of 300 AB per all of those seasons.
But, again, we are eliminating useful data (on the other hand, we are limiting our population to full-time or almost full-time players - if we include all players then our population might have a different spread of talent). Why not use all players with all number of AB per season or even the result of every AB for every player over any time period? Why do year to year regressions and resulting correlations? How would we do that?
Enter ICC or intra-class correlation, which is a favorite of Russell Carleton, formerly known as Pizza Cutter, who is a PhD psychologist and pyschometrician, and well-versed in statistical techniques (much more so than I). ICC is apparently able to take all that data, no matter how it is grouped and come up with an “r”. I am not surprised that it can do that, but I have one gigantic question which I have asked Russell on at least one occasion, probably more, and I have yet to receive or understand an adequate answer.
Since I just spent about a 100o words explaining why “r” depends on the underlying sample size of what you are measuring (PA, AB, BIP, number of times you time a runner, etc.), how can an ICC give you an “r” for a group of players with all different underlying sample sizes and what does that “r” mean if those underlying sample sizes are not reported along with the “r”?
And if I give Russell (or anyone else who knows how to do ICC - I don’t) a bunch of players’ BA and each player in each time period has around 100 AB, will that result in the same ICC as if I gave him a bunch of players with 1000 AB in each time period? If the answer to that is “yes,” then what does that “r” mean, since I told you that doing a traditional regression will ALWAYS result in an “r” of between 0 and 1, depending on the size of the underlying samples?
If the answer is “no” than don’t you have to give some kind of indication of the average underlying sample size or the distribution of sample sizes, when you present your ICC? I have never seen Russell do that. O.K., maybe not never, but not always.
For example, Russell recently wrote an article (an excellent one, BTW) on BP that looked at how often MLB managers ran their base stealers as a function of base stealing opportunities and the ability of the base stealer, and he compared each manager to the league average to see if they were aggressive or conservative. He wrote this:
To test this, I used one of my favorite techniques, the AR(1) intra-class correlation. It’s somewhat like the year-to-year correlation, but it enables the inclusion of more than just two time points. It can be read, however, like any old correlation. Over the seven years in the study, the ICC was a nifty .538.
Now, here he admits that ICC is like a y-t-y correlation, but that it enables you to do a correlation without splitting the data up into two time points for each player. He doesn’t say so, but it enables you to also use players (or managers in this case) with different underlying samples of performance, with respect to what you are measuring (although you CAN do that with y-t-y correlations, but as I also explained, it is a mess, as players with small underlying sample sizes bring the “r” down and players with large ones bring it up).
Here the underlying sample size is the number of base stealing opportunities for each manager in whatever time periods he is aggregating the data into - I assume one season at a time (you have to have at least two time periods for every element whether you are doing a traditional regression to get your “r” or an ICC).
Anyway, he gives us an “r” of .538. But, what does that mean? Obviously there is some spread of skill in how conservative or not managers are in terms of sending potential base stealers. Probably it is pretty large. But if we do a traditional regression, the size of the resultant “r” will, as usual, depend on whether we are regressing one set of 100 opps per manager on another set of 100 ops (say, one half-year to another half-year) or 1000 (say 5 years to 5 years). So does the ICC depend on that as well? If it does, doesn’t Russell have to tell us the size or the range of the underlying sample sizes? If not, then how do we interpret the .538?
In the comments section of the article, we, and other fellow names Ben Solow, had this dialogue:
Pizza, we may have discussed this before in another venue, but since “r” is always a function of (the underlying) sample size (not the number of pairs in the regression), in your intra-class correlations, how do we/you know the sample size associated with your “r”? For example, if I were working with the same data you are, and I regressed first half on second half, I might get an “r” of .4, if I regressed one whole year on another year, I might get an “r” of .5 or .6, if I regeressed 5 years of manager data on another 5 years, I might get .8, etc. In this instance, you mention that the “r” was .538. Without knowing how many games (or steal opportunities or whatever the “unit” is) that represents, I have no idea whether .538 is “consistent” or not.
Ben Solow.538 doesn’t refer to the r^2 of the logit regression, though. If I’m understanding the grouping decision correctly, that value (the ICC) is calculated as the ratio of the variance across managers to the sum of the variance across managers and the variance of managers over time. .538 means that (variance of managers) = .538*(variance of managers + variance over time), or that the variance between managers is equal to roughly 1.16 times the variance of a randomly selected manager over time, meaning managers are relatively more consistent over time than they are across individuals. I’m not as familiar with ICC as others (Eric and Russell both, for sure), but it seems that if sample size entered the equations for estimated variance it wouldn’t have much of an effect.
BP staff member Russell A. Carleton
Mr. Solow’s response is mostly right. ICC is a measure of consistency across the years. I did toss out most of the interim managers who only had a few games at the helm when I ran that ICC, specifically for sample size reasons. (He had to call for at least 50 SB attempts.)
Think of ICC like year-to-year. If I only had five observations per year, then I’d probably get a lot of random variation and so not a lot of consistency within managers over the years. My choice of inclusion cutoff was somewhat arbitrary, but based more on the realities of what we’re observing. We look at managers based on the season-to-season level, so I evaluated them as such.
MGL“If I only had five observations per year, then I’d probably get a lot of random variation and so not a lot of consistency within managers over the years.”
Do you mean managers with 5 SB opportunities or 5 managers per year? I am talking about the former, of course, when I am talking about sample size. The number of observations will NOT affect the correlations, only the standard error.
You always say, “Think of an ICC as like a y-t-y correlation.” But, as I originally said, the magnitude of a y-t-y correlation specifically depends on the number of “opportunities” in each year and without knowing that number, it means nothing. If I regress OBP on OBP from one year to the next, and I only include players with 100 or less PA each year, I might get a correlation of .25. If I only include players with PA greater than 400, I might get .60. So just saying, “My y-t-y ‘r’ for OBP was .5” means nothing unless I know the number of PA per year in my sample. (It is also nice to know the number of players or “observations” as that will help me to figure my standard error around the correlation.)
So if I have bunch of players in a bunch of years, and you tell me the ICC for OBP, again, that means nothing to me unless I know the range or distribution of PA in the sample, right?
Maybe I have it wrong. Maybe the ICC is sort of a combination of “r,” as when we do a y-t-y “r” and the underling sample size. For example, if you have a bunch of players with samples of 400 PA and you do an ICC for OBP and you have a bunch of players with samples of only 100 PA, will you come up with the same ICC?
Ben Solow
The magnitude of a year to year correlation does NOT necessarily depend on the sample size either over time or within a given year. Your estimate of the population correlation may be more accurate, but the value of that estimate is not a function of sample size. There’s some noise in these estimates, which means increasing sample size is always a good thing, but as long as there’s enough sample that the law of large numbers holds, you’re probably pretty safe.
BP staff member Russell A. Carleton
I meant 5 SB opportunities as well. I think we’re on the same page methodologically. You are correct in that the number of PA/BF/opportunities can affect ICC, much in the same way that it would affect yty. However, as Mr. Solow points out, so long as you set your inclusion criteria high enough, it’s not going to make a big differnence. In this case, I actually upped the criteria a bit and didn’t get much improvement in ICC. It’s something of an asymptotic relationship.
In this particular case, there are two different questions that one can ask. One is, “How reliable is this stat year to year?” (which I chose to ask, .538) The other is “How many PA/BF/opps does it take before this stat becomes reliable?” I haven’t run that one yet.
Again, I was wholly unsatisfied with their responses. Either I don’t understand the issue well enough, we are talking past one another, or they are not using or explaining ICC correctly with respect to these baseball situations (where there are different underlying sample sizes which is not usually the case in the real world - for example in the real world you give a bunch of students a test ONE TIME and then you may repeat the test several times over several time periods, but you don’t give one student a test 100 times in one time period and another student 10 times in one time period, and then retest both students in another time period, one of them 30 times and the other 120 times).
Anyway, would anyone else like to chime in on this issue? Did anyone even get this far?


I’m 90% sure that the sample size in terms of games managed matters. I suspect that if Pizza (or do we have to call him Russell now? I like Pizza. mmm...pizza /homer) did the same ICC using month by month data his r would be smaller, though I can’t confirm as I’m not sure how to do an ICC.
The other sample size, number of managers used, I don’t think has a predictable relationship to r, but the more you use the better as you’re more likely to get the true value you’re trying to measure. Having a small sample of managers could give you a larger or smaller r, depending on the luck of your draw. For example, player weight correlated to power would give you a decent r, but if you picked a small sample of players and wound up with Joe Morgam, Jimmy Wynn, Ken Oberkfell, and Casey Kotchman you’d see a negative relationship.
Just a nitpick, r can be from -1 to 1, with a negative r showing an inverse relationship.