THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, January 22, 2010

ICC or Intra Class Correlation

By , 06:56 AM

Warning!  Long post!

Many of you are aware of some of the comments and assertions that Tango and I (and others) have made about regression correlations ("r") as a measure of the amount of “talent” that a particular metric reflects.  For example, if you do a regression of a group of pitchers’ BABIP in one year on another year, especially if you limit your sample to pitchers who change teams from one year to the other (in order to filter out defense and park effects) you will get a very low correlation or “r”.  What that means (assuming that our sample of pitchers is large enough so that we are pretty confident in the magnitude of that “r") is that there is very little “talent” associated with a pitcher’s BABIP.  At least that is how we couch the “effect” in English.

What Tango and I have stated many times is that what we really mean by “very little talent” is that the spread of talent among the population that we drew our sample from (presumably in a random fashion) is small.  What we mean by “small,” since words like small and large only have meaning relative to something else (a large cat may be the same size as a small dog, right?), I am not sure.  Maybe it is relative to other pitching metrics - maybe it is relative to something else.

What I do know is that “there is little talent in BABIP among pitchers” is probably not the best way to couch the situation, since at some level below MLB, there could be a much larger spread of BABIP talent but that by the time MLB pitchers get to the show, they all have around the same talent, but not necessarily “no talent”, if you know what I mean.  It is likely that there is a small spread of BABIP among pitchers at all levels, at least as compared to other baseball talent, for pitchers and hitters, otherwise we would probably see a larger spread at the MLB level, but that is not a given.


One other thing related to these kinds of regressions and correlations that Tango and I have written repeatedly about is the magnitude of the correlation or “r” (I’ll just call it “r” from now on) as a function of the underlying sample size.  What we have said is that regardless of the spread of talent in the population of whatever skill or metric (the metric is actually a sample measure of that skill) we are investigating, the “r” that we get when running a regression is a function of two things:  One, the underlying sample size of each element in the regression, and two, the spread (variance) of that skill or metric in the population that or elements are drawn from.  One thing that is potentially confusing is the use of the term “underlying sample size.” When we do these kinds of regressions, there are two sample sizes: One is the number of players or elements in the regression, usually indicated as “i”.  That affects the reliability of the results of the regression, including the “r”.  That is not the sample size I am talking about throughout this post.  When I say the underlying sample size I mean the number of opportunities that go into each player’s metric, such as AB for BA, or BIP for a pitcher’s BABIP.  So if our regression contains 100 pitchers, each with a certain BABIP in one time period and then another time period, one “sample size” is “I” or 100, the number of elements in the regression.  The other “sample size” or “underlying sample size” as I call it, is the number of underlying BIP for each of those 200 (100 pitchers times two timer periods, where time period 1 is the independent variable and time period 2 is the dependent variable, or vice versa).  Obviously throughout this post, I am usually referring to the latter sample size or sizes (since usually there are many different sizes). In most social science research, the underlying sample size is one for every element in the regression.

Anyway, one of the more interesting things about the relationship between the size of the underling sample, the spread of talent in the population with respect to what we are measuring (BA, BABIP, etc.) is this:

If there is no spread of skill in the population (i.e. all players have the same true talent with respect to what we are measuring), the “r” will be zero when we run a regression, and:

If there is any spread of skill in the population, no matter how large or small, given a large enough underlying sample of performance, the resultant “r” from the same regression will approach 1.

Let me give a few examples:

Let’s say we are measuring baseball players’ true speed in 100 meters.  We’ll start with the assumption that every player has a “true” speed (that is their “true talent” in speed) and that it doesn’t change over time.  If everyone had the same true speed, whether we measured that speed once or 1000 times, if we ran a regression with, say 500 players, and we regressed one set of “times” (either 1 or 1000 or any other underlying sample size) on another set of times (again, either 1 or 1000 or any other underlying sample size), the “r” will always be zero.

Now, let’s say that among the population that those 500 players were drawn from, the spread of true speed were small (the variance is small).  Again, when I say “small,” it depends on “compared to what,” but it doesn’t matter right now.  Let’s say that one SD were .1 seconds, and that the mean were 10 seconds, so that 95% of all players in our population and presumably in our 500 player sample were between 9.8 and 10.2.  That is indeed a narrow range. In reality, it is probably closer to a mean of 10.5 and a SD of .5 seconds (or whatever).  Anyway, if we measured everyone once, we are going to get a lot of random (we’ll assume it is random at least) variation around everyone’s true time, because of the accuracy of the person with the stop watch, the weather conditions, the mood and condition of the person doing the running at the time the measurement is taking place, the condition of the surface being run on, etc., especially if those things are (randomly) different every time we do a measurement.

So if we measure everyone once AND the spread in true talent among the players is small, the random variability around each measurement is going to “swamp” the true variability in talent among the players, and the resultant “r” when we run the regression is going to be low.

However, if the spread of true talent is large, let’s say instead of a SD of .1 seconds it is 1 second, then even with the random variability surrounding each measurement, it will be fairly easy to distinguish the fast from the slow players, and the resulting “r” will be a lot larger even when if make only one measurement per player.

Now, suppose we do 10000 measurements per player in all kinds of conditions and we average them.  The assumption is that the resulting mean time will be very close to each player’s true time.  If that be the case, then whether the spread of true talent in the population is .1 or 2 seconds, guess what the “r” is going to be if we run a regression of one set of times (several thousand) on another?  Close to 1 of course.

The same exact thing is true for BA or for BABIP among pitchers.  If the spread is small or large, if we only measure it a few times - e.g., the number of AB in the BA is 5 or 10, or the number of BIP in the BABIP for pitchers is also 5 or 10 - if we run a regression of one time period on another, even assuming that true talent does not change over those time periods, the resultant “r” will be small.  If the spread in true talent is small, the “r” will be smaller, but given a small underlying sample size, like 5 or 10 AB or BIP, the “r” will still be small whether the true talent spread is small or large.  It might be .01 (small spread in true talent) compared to .03 (large spread in true talent) or something like that.

Now, if we sample BA or BABIP for each player over a large number of opportunities, say 500 or 1000 (the equivalent of a season or so for a full- time player), now our resultant “r” is going to be a lot larger whether the spread in talent is large or small.  It might be .2 and .4, whereas before it was .01 and .03. 

In fact, regardless of the spread of true talent in the population, for example, whether players’ true talent BA were between .260 and .280 only, or .200 and .350, if we sample them tens or hundreds of thousands of times once and then tens or hundreds of thousands of times again (the dependent and independent variables in our regression - one time period on another), the resultant “r” will be close to 1.
The moral of the story is that when I tell you that I ran a regression of a certain metric (and that metric has an underlying sample size, like AB, PA or BIP, or even the number of times I timed each player in a 100 meter dash) and I say that I got an “r” of .05 or .5, that number should mean nothing to you unless I tell you the average or the distribution of opportunities (underlying sample size) in my sample of players.

Even then, the meaning of the magnitude of the “r” doesn’t mean a whole lot unless it is compared to something else.  For example, if I say that I am working with BABIP for batters and I get an “r” of .4 and my average underlying sample size is 300 BIP for each element in my regression, does that mean that BABIP for batters reflects a lot of talent, a little bit of talent, or a medium amount of talent?  I don’t know.  It looks like it is “medium,” but if my underlying sample size were 100 rather than 300, the “r” might be .2, which looks small, even though the spread of talent in the population or the “amount” of talent in BABIP for batters has not changed.  And if the underlying sample size were 2000 BIP (3 or 4 years worth of data), the “r” might be .7 which “looks” large.

But, if I said that for 300 BIP for batters, the “r” was .4, and for 300 PA of OBP, the “r” was .6, but for 300 BIP, the “r” for BABIP for pitchers was .05, you might have a different opinion regarding the size of that .4 since you can compare it to other things we are measuring.  You would probably say that there is more “talent” (the spread of talent is greater) in OBP than BABIP for batters, and a lot more talent in BABIP for batters than for pitchers.  Everything is relative.

So, what does this have to do with the title of this thread?  Well, what if we have bunch of players in our regression and each player has a different underlying sample size.  Let’s say that we are using BA as our metric and some players have 50 AB and other players have 500 AB, both in one time period and the other time period (again, the dependent and independent variables)?  Well, that’s a mess.  We can do the regression, but regardless of the spread of talent with respect to BA in the population, the players with 50 AB in either one time period or both time periods will help to create a very small “r” and the ones with 500 AB in both time periods will help to result in a much larger “r”.  So the overall “r” when we run the regression will be somewhere in between and we can report that number and then report the average size of the underlying sample.  I don’t know enough about statistics to know whether the resultant “r” for a group of players where the underlying sample sizes vary a lot but average to 300 AB will be the same as a group of players who all have underlying sample sizes of 300 AB.  I suspect that they will be similar but not exactly the same and I also suspect that the ones that vary in the number of AB will have a smaller ”r” even though both groups have the same mean number of AB per player.

I suppose we could just eliminate the players with small underlying sample sizes, but if we do that, we are deliberately ignoring valuable information such that our regression and resultant “r” is going to be cleaner but less reliable (have a larger standard error) because we are effectively reducing the sample size of the regression.

Or, what if we just have a bunch of players from all different time periods and we want to determine some kind of correlation with respect to a particular metric?  Do we try and create a regression of one time period to another?  What if we have some players with 3 or 4 time periods and others with only 2?  Again, what if some of those time periods are comprised of 10 AB and others are 1000 AB?  What if we have the results of every AB (very valuable information) for a very long time for some players and not so long for other players?  Again, we can aggregate the data and then run one or more regressions of one time period on another.  Regardless of what we do, it is still a mess.

Typically, what a lot of researchers do, including myself, is organize and aggregate the data so that we run a regression of one time period on another for several time periods while restricting the underlying samples to some minimum number.  For example, we’ll take 10 years of BA data and run a regression of 2000 on 2001, 2002 on 2003, 2004 on 2005, etc. for all players with a minimum of 300 AB per all of those seasons.

But, again, we are eliminating useful data (on the other hand, we are limiting our population to full-time or almost full-time players - if we include all players then our population might have a different spread of talent).  Why not use all players with all number of AB per season or even the result of every AB for every player over any time period?  Why do year to year regressions and resulting correlations?  How would we do that?

Enter ICC or intra-class correlation, which is a favorite of Russell Carleton, formerly known as Pizza Cutter, who is a PhD psychologist and pyschometrician, and well-versed in statistical techniques (much more so than I).  ICC is apparently able to take all that data, no matter how it is grouped and come up with an “r”.  I am not surprised that it can do that, but I have one gigantic question which I have asked Russell on at least one occasion, probably more, and I have yet to receive or understand an adequate answer.

Since I just spent about a 100o words explaining why “r” depends on the underlying sample size of what you are measuring (PA, AB, BIP, number of times you time a runner, etc.), how can an ICC give you an “r” for a group of players with all different underlying sample sizes and what does that “r” mean if those underlying sample sizes are not reported along with the “r”?

And if I give Russell (or anyone else who knows how to do ICC - I don’t) a bunch of players’ BA and each player in each time period has around 100 AB, will that result in the same ICC as if I gave him a bunch of players with 1000 AB in each time period?  If the answer to that is “yes,” then what does that “r” mean, since I told you that doing a traditional regression will ALWAYS result in an “r” of between 0 and 1, depending on the size of the underlying samples?

If the answer is “no” than don’t you have to give some kind of indication of the average underlying sample size or the distribution of sample sizes, when you present your ICC?  I have never seen Russell do that.  O.K., maybe not never, but not always.

For example, Russell recently wrote an article (an excellent one, BTW) on BP that looked at how often MLB managers ran their base stealers as a function of base stealing opportunities and the ability of the base stealer, and he compared each manager to the league average to see if they were aggressive or conservative.  He wrote this:

To test this, I used one of my favorite techniques, the AR(1) intra-class correlation. It’s somewhat like the year-to-year correlation, but it enables the inclusion of more than just two time points. It can be read, however, like any old correlation. Over the seven years in the study, the ICC was a nifty .538.

Now, here he admits that ICC is like a y-t-y correlation, but that it enables you to do a correlation without splitting the data up into two time points for each player. He doesn’t say so, but it enables you to also use players (or managers in this case) with different underlying samples of performance, with respect to what you are measuring (although you CAN do that with y-t-y correlations, but as I also explained, it is a mess, as players with small underlying sample sizes bring the “r” down and players with large ones bring it up).
Here the underlying sample size is the number of base stealing opportunities for each manager in whatever time periods he is aggregating the data into - I assume one season at a time (you have to have at least two time periods for every element whether you are doing a traditional regression to get your “r” or an ICC).

Anyway, he gives us an “r” of .538.  But, what does that mean? Obviously there is some spread of skill in how conservative or not managers are in terms of sending potential base stealers.  Probably it is pretty large.  But if we do a traditional regression, the size of the resultant “r” will, as usual, depend on whether we are regressing one set of 100 opps per manager on another set of 100 ops (say, one half-year to another half-year) or 1000 (say 5 years to 5 years).  So does the ICC depend on that as well?  If it does, doesn’t Russell have to tell us the size or the range of the underlying sample sizes?  If not, then how do we interpret the .538?

In the comments section of the article, we, and other fellow names Ben Solow, had this dialogue:

Pizza, we may have discussed this before in another venue, but since “r” is always a function of (the underlying) sample size (not the number of pairs in the regression), in your intra-class correlations, how do we/you know the sample size associated with your “r”? For example, if I were working with the same data you are, and I regressed first half on second half, I might get an “r” of .4, if I regressed one whole year on another year, I might get an “r” of .5 or .6, if I regeressed 5 years of manager data on another 5 years, I might get .8, etc. In this instance, you mention that the “r” was .538. Without knowing how many games (or steal opportunities or whatever the “unit” is) that represents, I have no idea whether .538 is “consistent” or not.

Ben Solow

.538 doesn’t refer to the r^2 of the logit regression, though. If I’m understanding the grouping decision correctly, that value (the ICC) is calculated as the ratio of the variance across managers to the sum of the variance across managers and the variance of managers over time. .538 means that (variance of managers) = .538*(variance of managers + variance over time), or that the variance between managers is equal to roughly 1.16 times the variance of a randomly selected manager over time, meaning managers are relatively more consistent over time than they are across individuals. I’m not as familiar with ICC as others (Eric and Russell both, for sure), but it seems that if sample size entered the equations for estimated variance it wouldn’t have much of an effect.

BP staff member Russell A. Carleton

Mr. Solow’s response is mostly right. ICC is a measure of consistency across the years. I did toss out most of the interim managers who only had a few games at the helm when I ran that ICC, specifically for sample size reasons. (He had to call for at least 50 SB attempts.)

Think of ICC like year-to-year. If I only had five observations per year, then I’d probably get a lot of random variation and so not a lot of consistency within managers over the years. My choice of inclusion cutoff was somewhat arbitrary, but based more on the realities of what we’re observing. We look at managers based on the season-to-season level, so I evaluated them as such.

MGL

“If I only had five observations per year, then I’d probably get a lot of random variation and so not a lot of consistency within managers over the years.”

Do you mean managers with 5 SB opportunities or 5 managers per year? I am talking about the former, of course, when I am talking about sample size. The number of observations will NOT affect the correlations, only the standard error.

You always say, “Think of an ICC as like a y-t-y correlation.” But, as I originally said, the magnitude of a y-t-y correlation specifically depends on the number of “opportunities” in each year and without knowing that number, it means nothing. If I regress OBP on OBP from one year to the next, and I only include players with 100 or less PA each year, I might get a correlation of .25. If I only include players with PA greater than 400, I might get .60. So just saying, “My y-t-y ‘r’ for OBP was .5” means nothing unless I know the number of PA per year in my sample. (It is also nice to know the number of players or “observations” as that will help me to figure my standard error around the correlation.)

So if I have bunch of players in a bunch of years, and you tell me the ICC for OBP, again, that means nothing to me unless I know the range or distribution of PA in the sample, right?

Maybe I have it wrong. Maybe the ICC is sort of a combination of “r,” as when we do a y-t-y “r” and the underling sample size. For example, if you have a bunch of players with samples of 400 PA and you do an ICC for OBP and you have a bunch of players with samples of only 100 PA, will you come up with the same ICC?

Ben Solow

The magnitude of a year to year correlation does NOT necessarily depend on the sample size either over time or within a given year. Your estimate of the population correlation may be more accurate, but the value of that estimate is not a function of sample size. There’s some noise in these estimates, which means increasing sample size is always a good thing, but as long as there’s enough sample that the law of large numbers holds, you’re probably pretty safe.

BP staff member Russell A. Carleton

I meant 5 SB opportunities as well. I think we’re on the same page methodologically. You are correct in that the number of PA/BF/opportunities can affect ICC, much in the same way that it would affect yty. However, as Mr. Solow points out, so long as you set your inclusion criteria high enough, it’s not going to make a big differnence. In this case, I actually upped the criteria a bit and didn’t get much improvement in ICC. It’s something of an asymptotic relationship.

In this particular case, there are two different questions that one can ask. One is, “How reliable is this stat year to year?” (which I chose to ask, .538) The other is “How many PA/BF/opps does it take before this stat becomes reliable?” I haven’t run that one yet.

Again, I was wholly unsatisfied with their responses.  Either I don’t understand the issue well enough, we are talking past one another, or they are not using or explaining ICC correctly with respect to these baseball situations (where there are different underlying sample sizes which is not usually the case in the real world - for example in the real world you give a bunch of students a test ONE TIME and then you may repeat the test several times over several time periods, but you don’t give one student a test 100 times in one time period and another student 10 times in one time period, and then retest both students in another time period, one of them 30 times and the other 120 times).

Anyway, would anyone else like to chime in on this issue?  Did anyone even get this far?

#1    Rally      (see all posts) 2010/01/22 (Fri) @ 10:30

I’m 90% sure that the sample size in terms of games managed matters.  I suspect that if Pizza (or do we have to call him Russell now?  I like Pizza.  mmm...pizza /homer) did the same ICC using month by month data his r would be smaller, though I can’t confirm as I’m not sure how to do an ICC.

The other sample size, number of managers used, I don’t think has a predictable relationship to r, but the more you use the better as you’re more likely to get the true value you’re trying to measure.  Having a small sample of managers could give you a larger or smaller r, depending on the luck of your draw.  For example, player weight correlated to power would give you a decent r, but if you picked a small sample of players and wound up with Joe Morgam, Jimmy Wynn, Ken Oberkfell, and Casey Kotchman you’d see a negative relationship.

Just a nitpick, r can be from -1 to 1, with a negative r showing an inverse relationship.


#2    Tangotiger      (see all posts) 2010/01/22 (Fri) @ 10:48

If the answer is “no” than don’t you have to give some kind of indication of the average underlying sample size or the distribution of sample sizes, when you present your ICC?  I have never seen Russell do that.  O.K., maybe not never, but not always.

I’d say that Pizza usually, if not almost always, reports the n in addition to the r.

However, MGL is correct that readers are so focused on the r, that they forget about the n.

Take for example in BPRo’s book where they report an r=.33 or so in the Clutch chapter.  Wow, you might think, finally an r above .10 for a clutch study.  Buuuuuuuuuuuuuuut, that was based on around 2500 PA in each sample (they looked at even-odd years for a career).

Since r = PA / (PA+x) is a good shorthand, then we can see that the “x” value being 5000 would make r=.33

And, if you only have 550 PA, then r=.10.

So, of all the values I’d prefer to see reported, if you are going to report just ONE number, report the “x”.

Or, similarly, answer this question: how many PA (or opps) does this metric need in order to get an r=.50.  That is it.  You do that, and then we are all on the same page.  Because when you report r=.50, you are also reporting on the “x” value.  And once you have the “x” value, you are now able to report on the r for ANY number of PA.  It’s fantastic!

For example, if I say x=50 for K, x=100 for walks, x=150 for HR, x=300 for BABIP, x=700 for reaching on error, then you know EXACTLY how persistent each metric is.  It’s clear, it’s concise, and you can report it in one number. 

(All numbers for illustration purposes only.)


#3    Tangotiger      (see all posts) 2010/01/22 (Fri) @ 10:57

Having read the Pizza/Ben/MGL exchange, I have to say that MGL is 100% right here.

Pizza’s response was very unsatisfactory.  I have to believe he didn’t understand MGL’s point.

If Pizza limited himself to just 81 games or 810 games, he’s going to get a far different r.  That’s a given.  Pizza’s response makes it seem like once he gets an “appropriate” level of games, then it doesn’t matter how many more games.

Ben’s response I think is missing the point to what we are discussing here.  Which is fine, because I used to think like he did.

Gettig back to my previous post, if let’s say the average number of games included in Pizza’s sample is 500, and if his r is .538, then the “x” value in this equation:

r = G / (G + x)

is 429

(All numbers for illustration purposes only.)

So, if Pizza used 81 games, his r would come in at r=.16.

If he used 600 games, then r=.583

So, we can see why he would say the extra games don’t change the r much.  He added 100 games, and he gets an r from .54 to .58.

But that’s entirely a function of the “x” value.


#4          (see all posts) 2010/01/22 (Fri) @ 11:06

Lost me a little when you got to the ICC stuff… but the first half of this post should be taught to every undergrad in a stats class.  Really important concept, and you guys explain it really well.


#5    bsball      (see all posts) 2010/01/22 (Fri) @ 14:33

I think at least part of the confusion is that there are two things going on here.  There is an ICC process, and an AR(1) process. They are mixed together.

The ICC process (that I learned long ago as ANOVA) tries to to tell whether there is a difference between two or more groups by comparing the variance within groups (using group mean) to the variance between groups (using overall mean).  One of the assumptions commonly used in these techniques is that all groups have the same variances (using group means), even though they may have different means.  In other words you aren’t looking at a sample variance for manager A compared to manager B, you are looking at the variance for all managers (compared to their own means).

There’s another part of the analysis that is the AR(1) part.  AR(1) is an auto-regressive process.  This is the part that’s similar to year-to-year.  AR(1) says:

X(t) = c + A * X(t-1) + e

The current result is made up of one part constant, one part equal to some portion of the prior result, and one part error (mean = 0). Current steal % = constant steal % + A * prior steal %

The long-term mean for this process wound equal c / (1-A).  This type of process would imply that a manager’s decisions on stealing have a long-term average, but they are also influenced somewhat by whether his teams have been stealing recently.

That’s how I read it anyway.


#6    Pizza Cutter      (see all posts) 2010/01/22 (Fri) @ 16:03

I’ll probably have to cut this short, due to my schedule today, but I will add as I’m able.  And feel free to call me Pizza.  It’s still a little jarring when people leave comments and call me Russell.  The issues that you bring up are ones that I quietly struggle with as I’m conducting my research and writing up my articles.

A little bit of terminology to clear things up.  Sample size refers to how many players (or usually, player-seasons) you are looking at.  Sampling _frame_ is how many PA/BF/opps you have for each player in your sample.  Usually, in this type of research, everyone’s sampling frame is a different number (Larry had 345 PA, Curly had 322 PA, Moe had 498).

Also, I want to distinguish two different types of variance: between subjects and within subjects.  Between subjects variance is “Larry is faster than Moe.” His “true” time is less than Moe’s.  Within subjects variance is “I ran a personal best today” meaning that I ran faster than I usually do, even if I was still in last place compared to the rest of the field.

ICC (and yty correlation) is interested in within-subjects variance.  And you are quite right that sampling frame (the number of observations per person present in determining the measure) is key to how they will turn out.  ICC’s advantage over yty is that you can take more than two years into account.  But it does suffer from the same sampling frame issues that you would expect from yty.

To show how sampling frame can have a big effect, let’s take the extreme case of looking at batter OBP on a nightly basis (a sampling frame of 4-5 PA) for consistency.  If the datapoints for Larry Larfelschnarger were .200, .500, .000, and .400 (and we had similar data points for all other MLB players), we’d get a low ICC, because those numbers are jumping around all over the place.

Now, more observations = better estimate of true talent, but this is subject to the laws of diminishing marginal returns.  Going from 5 to 15 PA will make a measure a lot better than going from 450 to 460.  Eventually, we’ll make it to 1.00 on ICC, but that might take thousands of observations, and often we don’t have that luxury.

When I write, I often try to take advantage of this by setting some reasonable minimal inclusion criteria in the knowledge that bumping up a little bit more won’t make much of a “clinical” difference.  There’s also the real issue of the tradeoff between sample size and sampling frame.  Relatively fewer players get 600 PA than 300 PA (and those who get 600 are a select bunch).

I don’t normally report the N more for stylistic reasons.  That level of gory details usually bores people.  However, when people ask, if I have it handy, I’ll give it to them.

More later.


#7    Tangotiger      (see all posts) 2010/01/22 (Fri) @ 16:22

When I write, I often try to take advantage of this by setting some reasonable minimal inclusion criteria in the knowledge that bumping up a little bit more won’t make much of a “clinical” difference.

I follow you all the way.  But, here’s the problem, in the bold.  Basically, the reasonable minimum is whatever it is that lets you get to r=.50 to r=.75, right?

If your r is under .25, you won’t bother writing an article.  And if r is over .90, then that’s overkill.

So, the ONLY suggestion being made here is that the reader MUST know what the sample_frame is, with respect to r.  Basically, no matter what you do, you will ALWAYS be able to report an r=.538… you simply have to look for a “reasonable” sampling frame.

The story is not the r, but the sampling frame itself!

I will reiterate, as I always do, that the more interesting thing is to report the sampling frame that lets you get r=.50.


#8    Pizza Cutter      (see all posts) 2010/01/22 (Fri) @ 16:47

Re-reading the original post now that I don’t have a crabby infant on my lap, the question seems to be either “Isn’t sampling frame important?” and my answer is yes, for several of the reasons which you pointed out.  I think you (MGL) have got the math underpinnings down right.  I would argue that with a little responsible statistical practice, those issues can be minimized.  That’s one of those case-by-case issues that go with the debate.

The other possibility is “Why not report the sampling frame size (or give a range/distribution/etc.)?” My answer is that it just doesn’t make for interesting reading.  It does fall a bit into the black box/"trust me, I’m an expert” trap, and I’ll own that.  The reason that I do it is that if people want the technical details, my e-mail is at the bottom of the article for the handful of people who are interested.

The sampling frame that I usually use is the player-season (plus a decent filter for minimum PA/BF), because that’s how people usually think about baseball, plus it answers the question “OK, so in 2009, his WXYZ was this… how good of a predictor is that for his WXYZ in 2010?” which is the question on most people’s minds.  If the ICC is low (or high), it’s the ICC that answers the most relevant question.  In this case, .538 was the ICC at the manager-season level.  If I had stretched things out to take two-year or three-year samples, it probbaly would have gone up some.  It’s an interesting question, just not the one that I pursued.

I have done a lot of the nitty gritty research extending beyond one year to see at what level of PA that (insert stat here) reaches an appropriate level of reliability.  It’s cool stuff for me as a nerd, but it’s probably really boring to people looking for a casual read at work.  That’s a philosophy that everyone might not share, but it’s mine.


#9    Pizza Cutter      (see all posts) 2010/01/22 (Fri) @ 16:52

bsball/5: there are a number of techniques that go by the name intra-class correlation.  The one that I specifically use is AR(1) rho, which is different from the ANOVA one that you learned.  (In fairness, both are called ICC’s in textbooks, which makes things really confusing.)

A lot of times AR(1) matrices are used with time series data with much much much shorter periods of re-sampling than one year.  In this case, I’m using it as a measure of year to year variance/similarity within a manager.


#10    Pizza Cutter      (see all posts) 2010/01/22 (Fri) @ 17:10

Tom/7.  Actually, if I found an ICC over .9, I’d write a book on whatever topic it was!

“Reasonable” in this context is usually just a matter of screening out cup-of-coffee callups so that guys with really really small sampling frames don’t work their way into the overall sample.  Sometimes a little more stringent, as things are called for.

Much depends on the question under investigation.  If I want to find out “Where does this stat cross the magic boundary that we’ve set for reliability?” then I’d use different methods.  If I needed a pure-er sample for some reason, I’d take specific care.  But if it’s more of a conceptual piece, the real star is whatever new methodology or stat I’ve dreamed up, and I use ICC as something a throw-in to let people know that there really is something worth looking at here (or nothing worth looking at, depending on the number.) And the default frame for people to think in is the seasonal level.


#11    bsball      (see all posts) 2010/01/22 (Fri) @ 19:33

Pizza,

Thanks for the clarification. I’m having trouble finding a website that describes an AR(1) ICC method. 

Does this AR(1) rho method you use estimate rho in the following equation?

x(t) = rho * x(t-1) + e

Or is it something else?


#12    MGL      (see all posts) 2010/01/22 (Fri) @ 22:32

Got it Pizza, and everyone else.  Thanks for the clarification.  I had originally thought that an ICC was essentially equivalent to a y-t-y (or whatever time frame the sampling frame represents) but that it has the added benefit of allowing you to use different numbers of time frames for each element in the sample, and not just two.  Apparently that is correct. 

However, after reading someone like Pizza giving an “r” with no reference to a time frame or a “sampling frame” I was beginning to wonder if maybe that wasn’t correct.

For the record, I would NEVER think of presenting an “r” whether it be generated from a traditional regression or an ICC, without giving the sampling frame as well, for obvious reasons.  Never, ever, ever.

And thank you for telling me what the real term is for what I had been calling the “underlying sample size” to distinguish it from “N” - the number of pairs in the regression.  One more question:  Why are you guys putting an underscore in between the words “sampling” and “frame?” Are we writing computer code here?


#13    Tangotiger      (see all posts) 2010/01/22 (Fri) @ 22:58

Well, I am a programmer, and I’m used to do that.

Yes, I like the word “sampling frame” so that to Pizza for saying that.


#14    Mike      (see all posts) 2010/01/22 (Fri) @ 23:43

So, it seems that the next logical question is this: how can we run these ICC regressions ourselves?  What program should we use?  Can anybody walk us through an example?

Pizza, this may take more effort than you’re willing to put in, but perhaps you’d be willing to point us to another website that explains the nuts and bolts? 

To me, this would be tremendously helpful.


#15    Pizza Cutter      (see all posts) 2010/01/23 (Sat) @ 01:16

I use SPSS (now PASW… IBM bought the company that makes SPSS and changed the name), although that’s a program that few outside of social sciences use. 

You can do it in SAS (I know I did at one point, although I would have to look up the code.)

The process that I use is hierarchical linear modeling (also called mixed linear modeling).  To really break it down would take a full course on the subject, which is how I learned it to begin with.  There’s a book by Raudenbush and Byrk ("Hierarchical Linear Modeling"… catchy title, eh?) that’s comprehensive, although barely readable.

Here’s a good website on the topic:
http://faculty.chass.ncsu.edu/garson/PA765/multilevel.htm

I think I might even take a look back through that page.


#16    Colin Wyers      (see all posts) 2010/01/23 (Sat) @ 03:33

For those without the money for a nice, expensive stats package, GNU R is generally pretty good. I don’t understand what exactly Pizza is doing here, but there is an ICC function for R, for instance:

http://www.personality-project.org/r/html/ICC.html

There’s also several packages that do HLM as well, I’m sure, although I haven’t done anything with them.


#17          (see all posts) 2010/01/23 (Sat) @ 15:12

[This is the first paragraph of much more. I am not getting Preview or the option to post.]

Let me suggest that Russell/Pizza is still halfway between Russell talking to guys who want a quick read at work and Pizza talking to you guys. --or maybe worse it’s a three-corner stool with Professor Carleton teaching statistical computing for psychology students.


#18    MGL      (see all posts) 2010/01/23 (Sat) @ 15:43

"[This is the first paragraph of much more. I am not getting Preview or the option to post.]”

I have had the same problem. Sometimes a long cut and paste does not work.  Some kind of anti-spam feature.  Especially if you have HTML tags in the text.  Try getting rid of any HTML tags first. If that does not work, limit the length of the post.


#19    Steve Sommer      (see all posts) 2010/01/23 (Sat) @ 15:45

Just wanted to chime in and say I’ve found the discussion so far very educational.  It’s something I’ve only peripherally thought of before.


#20          (see all posts) 2010/01/23 (Sat) @ 16:02

#12
“”
And thank you for telling me what the real term is for what I had been calling the “underlying sample size” to distinguish it from “N” - the number of pairs in the regression.  One more question:  Why are you guys putting an underscore in between the words “sampling” and “frame?” Are we writing computer code here?
“”

The answer may be yes. A “sampling frame” isn’t quantitative. It’s more abstract. You may be looking for one quantitative property of a sampling frame, or of one specialized class of sampling frames, as “sample size” is the quantitative property of a sample. Or you may be looking for a quantitative property of the data that are commonly collected(?) using a sampling frame.

A computer program, and by habit its regular users, may use “sample_frame” with underscore to specify properties (parameters) of a sampling frame and/or the data associated with it.

Some computer programs, including GNU R for statistical computing and many neighboring tasks, use equations such as
: iso2008= sample_frame(30, ...)
to specify some previously defined arrangement of data (observations) with 30 teams at the highest level, ...

Even if this is true, it’s also true that Russell has not yet more than alluded in a shorthand way to the quantitative feature of the datasets (of glorified proportions?) that concerns you guys. It’s a shorthand that will come back and bite you if you use it inside a few of the wrong earshots.

P.S.
Statistical entries at Wikipedia are generally very good. “Sample frame” is not one of the very good ones but it’s good enough to support the main point.


#21          (see all posts) 2010/01/23 (Sat) @ 16:05

Mitchel’s “underlying sample size” isn’t bad. The first and third words are ok. The question is, when you get down to the lowest level of the structure what is the extent or scope (or “size”?) of aggregation. Suppose the lowest level is the year-team-player and down at that level you have OPS+ data that is derived from --well, OPS+ can’t be derived down at that level and maybe we’ll get back to that.

When you do have quotients “observed” at the lowest level, such as slugging average or pitcher winning percentage or BB/K or K/BB, then you are concerned with the statistics of ratios (ratio estimation, ratio estimator, etc). If the quotients are shares such as pitcher winning percentage, where the denominator and numerator count a setand one of its subsets, then you are concerned with the statistics of so-called proportions. (To me ordinary proportions include other ratios but some others do think shares are integer counts such as your 10000 shares of General Motors.)

When you are interested in true shares/proportions but you do have the more basic count data “observed” at the lowest level, it is simple and commonplace to use uppercase N for the units of observation or experiment 1...N and the vector lowercase n = n1...nN for the sizes of those sets with interesting subsets, ie for the denominators in calculation.


#22          (see all posts) 2010/01/23 (Sat) @ 16:41

It may be useful to call lowercase n the numbers of “trials” a term commonly used regarding Binomial and Multinomial probabilities or models. The trials are the more fundamental somethings whose different outcomes you have summarized by counting over many trials.

Example
27 “trials” (pitcher decisions, Pedro Martinez)
23 “successes” or count of one trial outcome
4 “failures” or count of another trial outcome, the only other outcome in binomial or 2-outcome case

23-4 W-L
23/27 Winning percentage

Because batter Plate Appearances PA and pitcher Batters Faced BF are practically exhaust the baseball events from one perspective, they are quite similar to multinomial trials in probability and statistics. (Call them pitcher-batter “faces” because “faceoffs” would borrow too much from hockey?)

Batter At Bats and pitcher Decisions fit multi-outcome trials (multinomials) with a little more stretching.

For ratios of two outcome counts such as BB/K, it’s beyond the crudest metaphor to call the strikeouts trials.

For OPS+, a very complicated derivation from the counts of kinds of outcomes of kinds of trials, let’s not go there yet.


#23    Colin Wyers      (see all posts) 2010/01/23 (Sat) @ 20:02

If you want to make BB/K behave somewhat like a binomial, you can do:

BB/(BB+K)

If you just sit there and tell yourself that the denominator is “non-contact outcomes” you don’t feel quite as guilty.


#24    Matty D      (see all posts) 2010/01/26 (Tue) @ 23:44

Well, a lot of this is over my head, but I can speak to something way up toward the top of MGL’s post- before the ICC talk.

He was saying that the spread in talent of MLB pitchers on BIP is small, but wasn’t sure what it was small in relation to, though he alluded to the answer later in the post. It’s in relation to the random variance in results for a single player (with constant talent)-- the within variance Pizza mentions in the comments. The higher the within variance or the lower the variance in skill between players, the less weight you would give to a player’s results for a given number of trials.

If player skills are normally distributed, the x in Tango’s r = n/(n+x) equation will equal the ratio of the within variance to the variance in skill.


#25    Vic Ferrari      (see all posts) 2010/01/27 (Wed) @ 20:27

mgl:

Usually I find wiki to be a bit confusing on statistical matters.  It seems to be aimed at people who do it for a living, and I don’t.

In this case, however, it’s fairly explicit.
http://en.wikipedia.org/wiki/Intraclass_correlation

I just can’t see the value in breaking each pair of seasons into a class.  I mean the year to year method has it’s warts, especially the problem with the pairs with lower underlying sample sizes driving the r more than we’d like.  But most people reading (certainly me) have a sense of what a year to year correlation should be, depending on how many plate appearances there are on average, and where the lower cutoff is ... it gives us an arrow, a sense of the size of it and the direction it points.

ICC is going to lump the pairs together, so that will eliminate some of the frame size issue I’m sure.  But what’s the point if few people have a frame of reference for the “r” value that it churns out?

BTW, the link to random effects models near the bottom of that wiki page is well worth following.  Enlightening stuff.  I read a paper the other day that used this methodology ... reading that though, I can’t see a practical reason for it’s use with the baseball issue in question.  Perhaps a familiarity with the software came into play, or perhaps it was an intentional effort to obscure the fundamental thinking, I dunno.

Having checked the STATA software input variables, the operator has a lot of leeway.  Too much for my sensibilities, unless the operator had previously demonstrated an intimate knowledge of the underlying math as well as the subject matter.

Wiki puts it well:
A number of different ICC statistics have been proposed, not all of which estimate the same population parameter. There has been considerable debate about which ICC statistics are appropriate for a given use, since they may produce markedly different results for the same data


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

Feb 12 03:15
New PECOTA

Feb 12 02:42
Whitney Houston

Feb 12 02:23
Psst… wanna intern in Canada?

Feb 12 01:57
Who is Jeremy Lin?

Feb 12 00:40
Clutch analogy

Feb 12 00:38
Reader Mail of the Day: Why do we need X years of fielding data?  And what about outliers?

Feb 11 20:11
Fighting leads to goals?

Feb 11 19:55
Why do players get crappy caps?

Feb 11 19:12
Hero of the month: Brittney Baxter

Feb 11 17:59
MGL: Today on Clubhouse Confidential