THE BOOK cover
The Unwritten Book is Finally Written!
An in-depth analysis of: The sacrifice bunt, batter/pitcher matchups, the intentional base on balls, optimizing a batting lineup, hot and cold streaks, clutch performance, platooning strategies, and much more.
Read Excerpts & Customer Reviews

Buy The Book from Amazon


SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
MOST RECENT ARTICLES
MAIL : You ask | We say

Advanced


THE BOOK--Playing The Percentages In Baseball

<< Back to main

Friday, November 12, 2010

r v r-squared

By Tangotiger, 04:00 PM

Let me explain something with a simple example.  Open up Excel, and put =rand() in cells A1 through A1000, and in B501 through B1000.  In cell B1, put =A1.  Copy that to B500 (where it will show =A500).  So, for half the records, the data in columns A and B are identical, and in the other half, there is no relationship whatsoever.

Suppose you were asked to explain in english the data, without resorting to correlation or variances?  Well, you would say that exactly 50% of the data points in column A are perfectly explained by those in column B, while the other 50% of the data points are completely random.  Now, how do you think you would represent that as a number?  Well, something like: myNumericTranslationOfWhatISee = .50.  Or, more simply, r=.50.

If you run a regression using the Data Analysis package, what will be the result?  r=.50 and r-squared=.25.

Now, we just figured out what r is.  What does r-squared represent? r-squared, or the coefficient of determination:

The squared correlation coefficient (r2) is the proportion of variance in Y that can be accounted for by knowing X. Conversely, it is the proportion of variance in X that can be accounted for by knowing Y. The squared correlation coefficient is also known as the coefficient of determination. It is one of the best means for evaluating the strength of a relationship. For example, we know that the correlation between height and weight is approximately r=.70 If we square this number to find the coefficient of determination - r-squared=.49 Thus, 49 percent of one’s weight is directly accounted for one’s height and vice versa.

Variance is the square of standard deviation.  Standard deviation is something that we can understand.  Variance is the square of that.  r-squared is based on the variance of the data points.  Using variance sounds like bullsh!t to me. 

Specifically, how is 49% of the weight directly accounted for?  It’s not.  49% of the variance is accounted for.  However, who cares about the variance?

In my simple example, what numerical value explains the data, 0.50 or 0.25?  I don’t see any need to have to be shown r-squared.

***

I have similar misgivings when taking the log of salary, and then trying to best-fit against the log, and then people talking about as if the correlation is against salary itself.  Well, it’s not.  It’s against the log of salary and we don’t care about that! Why do researchers use the log of salary?  Because they don’t want to best-fit against a wildly curved (exponential) line (which is what they should do) and instead want to best-fit against a more controlled line, because they are afraid of how much the extreme (unlogged) data point would affect the shape of the best-fit line.

Fine.  But what the researchers are doing is best-fitting to the log of the salary.  They are intentionally accepting to potentially miss big on the unlogged salaries at the extreme end.

***

These two things I’ve always thought of as bullsh!t.  Tell me I’m wrong.


#1          (see all posts) 2010/11/12 (Fri) @ 17:36

Not sure exactly what you are getting at with the log of salary thing.  Let me give an example and you can let me know if I am on the right track.  Suppose variable y is exponentially related to variable x:  y=a*exp(b*x).  I measure y as a function of x, then do a non-linear least-squares fit to find the best values of a and b.  I do a weighted fit.  Suppose for the sake of argument that all the points are equally weighted (i.e., sig_y=C, a constant).  Then I don’t even need to do a weighted fit.

Suppose now that I fit ln(y) to the function A+b*x
(where A=ln(a)).  I claim I will get exactly the same answer as before provided I appropriately adjust the weighting factors.  In particular,
sig_ln(y)=C/y.  More generally, sig_ln(y)=sig_y/y.


#2          (see all posts) 2010/11/12 (Fri) @ 17:38

Well, I won’t tell you that you are wrong. Sometimes it’s convenient to run a regression on a log of salary only because it makes it so that you can interpret the coefficients as something like percentage change. I think using logged dependent variables for such expository purposes is inoffensive, but otherwise is often bullsh!t.


#3    Matt Swartz      (see all posts) 2010/11/12 (Fri) @ 17:47

Okay: you are wrong.

The log of salary is taken because of the theory of the human capital production function which is that output (Y) is a function of capital (K) and labor (L) of the functional form:

Y = A * (K^a) * (L^b)

Where A describes the state of technology, and a and b are scalar exponents.  This production functions explains the production of the economy very well.

Under the assumption that workers are paid their marginal product of labor, the derivative of product with respect to labor ( = b*A*(K^a)*(L^(b-1)) ), we find the approximate wage of a unit of labor.

Researchers have found that the best expression of human capital is to consider higher paid workers to have more units of labor that they are selling per unit of time, and so L is viewed as a product of human capital units (Z) and hours worked (H). 

Thus, we get wages by

MPL = wage = b*A*(K^a)*(Z^(1-b))*(H^(1-b)).

To find the amount of human capital, Z, you need to take the log of wages and you can the following formula:

log(wage) = log(b*A*(K^a)) + (1-b) * log(Z) + (1-b)* log(H)

This allows you to determine Z as a function of education, experience, etc., by running a regression on log wages.

The functional form also neatly explains why wages are log-normally distributed.

*Note: I might have screwed up some of the math because I’m re-deriving this all from memory from like 5+ years ago.  I’m not going to be able to derive it further than this, but the point I’m making is that the log-earnings regression is done because of the human capital production function, on which there has been a ton of literature showing why this approximates things best.  It’s not just to avoid explaining the tails or whatever.


#4    Tangotiger      (see all posts) 2010/11/12 (Fri) @ 18:01

I didn’t mean to introduce economic theory here.

Let’s say you have MLB player salary, and you have talent and salary.  And let’s say that you have one data point where the player has a bit more talent, but his salary is 100x what it should be (he’s paid 2500 million$).

If you best fit that using the log, you’ll probably have a similar equation whether you use the normal data, or the one where you changed the data point to 100x his should-be salary.  And that’s because you best fit the talent against the log of salary.  It’s not a best fit of talent to salary.

Now, if instead you best-fitted to salary, I am suggesting you will get a much different equation, because how else are you going to minimize the difference other than to make sure the line is going to come pretty close to that last data point (the 100x one)?


#5    Matt Swartz      (see all posts) 2010/11/12 (Fri) @ 18:08

Well, for one thing, if you are trying to do an OLS regression, you are supposed to assume that the errors are normally distributed, so presuming “best-fit line” = “OLS regression estimated equation”, then you are making a bad assumption.  The error is clearly not normally distributed. 

I’m not sure what the point you’re making is now.  If you want to confine yourself to a straight line, but want to minimize the square error of the estimate versus the actual, of course having a big outlier is going to affect your estimate.  When log-salary regression is done, it’s typically to estimate the effect of education, experience, or whatever other variable on human capital, which typically has a percentage rather than an absolute increase.

What is the goal that is failing here?  Maybe that will focus the issue.  I just don’t think you are making your r vs. r^2 squared by talking about log-salary regression because log-salary is based on something else entirely.


#6    Colin Wyers      (see all posts) 2010/11/12 (Fri) @ 18:15

Right, OLS comes with certain sets of assumptions about the data. In this case, anything that is distributed log-normally needs to be translated into the normal distribution for OLS to “work.”


#7          (see all posts) 2010/11/12 (Fri) @ 18:20

I’ll begin with the caveat that perhaps I’m out of my depth here, and if someone knows statistics better than me, they are welcome to correct me.

However, I think, Tom, that your R vs. R-squared example breaks down as soon as you have more than one source of random variation.

If you have two things that each explain 50% of the deviation, you have not explained 100% of the total deviation.  You’ve explained ~70% of the total deviation.

Using the variance will make that obvious because random errors add quadratically.  You can simply add up the individual variances and take the square root of the sum.  You cannot simply add up the correlations from multiple things and get anything with a real meaning.

I believe, though, that you are correct about your complaint against the wording of the “49 percent of one’s weight is directly accounted for one’s height” statement.


#8    anon      (see all posts) 2010/11/12 (Fri) @ 20:33

I think I’m rehashing a lot of what appears above, but maybe from another perspective. 

One of the main reasons to work on the log scale is that it makes the data appropriate for ordinary least squares.  If you work on the exponential scale, you have to do extra work to ensure that data at the “high” end of the curve, which typically is more variable on an absolute scale, is weighted appropriately compared with data at the “low” end.

On the log scale, this problem often (not always!) sorts itself out, and you can ordinary least squares without weighting.  Then you can transform back if need be.  If you do this, you’ve done relatively simple procedures whose validity can be easily assessed. 

As Alan suggests, both approaches should give you similar answers, but the latter gives you an implementation with fewer places for user error and better diagnostics.


#9    James      (see all posts) 2010/11/12 (Fri) @ 20:41

You are wrong on the variance issue because when you drew the line you did so to minimise the square of the distance from each point to the line not the absolute distance. R2 measures how far you are from a perfect fit compared with a line parallel to the x axis through the mean of the values.  If you hate variance you will have to redraw all your lines not using the square of the difference as it doesn’t make much sense to proclaim r as the better measure when you actually did the calculation to find r2. Also r has no meaning if your line isn’t straight but r2 still tells you how much of what you are trying to reduce you have reduced.  I know r2 is less intuitive. But it is what determines where the line goes.  You can’t dismiss r2 as unreal when the same approach generated r


#10    Colin Wyers      (see all posts) 2010/11/12 (Fri) @ 21:09

Tango, try this instead.

From rows A1:A1000, use the function RAND(). For rows B1:B1000, use AVERAGE(A1+RAND()), and so on for every row. You should see a correlation of .7, for an r-squared of .49 (or so).

Or you could save yourself the trouble, and just look at this:

Sheet1 is the way Tango did it, Sheet2 is the way I did it. The scatterplots should explain what’s going on.


#11    Tangotiger      (see all posts) 2010/11/12 (Fri) @ 21:17

Matt, the r v r-2 is a separate matter from the salary.

And, no, I’m not making the best fit a straight line.  Just any straight or curved line.


#12          (see all posts) 2010/11/12 (Fri) @ 21:21

Agree with you on the first point, obviously.


#13    Colin Wyers      (see all posts) 2010/11/12 (Fri) @ 21:30

Bah, forgot to include the link:

http://www.editgrid.com/user/cwyers/randr


#14          (see all posts) 2010/11/12 (Fri) @ 22:39

Phil/12, you may feel it’s obvious, but I have the same objection to your assertion that I did to Tango’s.  Your “obvious” way things works breaks down in the multi-variate case, i.e., the real world that we have to deal with in baseball.  That some people misuse the terminology is true; you have to use the terms correctly, of course.


#15    Matt      (see all posts) 2010/11/13 (Sat) @ 02:07

Seems like a lot of bother about nothing for R v R^2. Since one is a direct transformation of another, what difference does it make which one you use? It’s like arguing that saying “0.1” is confusing but “1/10” is not.

Plus, R^2 is always sort of a red herring anyways in the multivariate case, because adding more variables always increases your R^2. So you can’t rely on a higher R^2 to mean a “better” model or more explanatory power.

Maybe that’s my statistician’s perspective, that the distinction is not a big deal—the people who are writing are not exactly statisticians but are economists or social scientists or vague analysts, and linear regressions and R or R^2 is what they know, and “higher is better.” But in any case, I like James #9, r^2 is the direct calculation because you are minimizing the sum of squares.


#16    Tangotiger      (see all posts) 2010/11/13 (Sat) @ 02:50

But in any case, I like James #9, r^2 is the direct calculation because you are minimizing the sum of squares.

I don’t like that argument.  That’s like saying you prefer seeing MSE rather than RMSE.


#17    Tangotiger      (see all posts) 2010/11/13 (Sat) @ 02:58

Matt: it’s based on what it actually means.  why not r-cubed, or r-root, since they all are derivable.  It’s what the number itself intrinsically means.

Mike: yes, summing the r-squares is what you would do, though I’m not sure that’s a reason to do anything.

Colin: hmmmmm.... excellent.  Gives me something to think about.


#18          (see all posts) 2010/11/13 (Sat) @ 07:12

Mike/14: I meant it was obvious that I agreed with Tango, not that it was obvious that we’re right.


#19          (see all posts) 2010/11/13 (Sat) @ 07:18

My impression of why academics use the log of salary:

Suppose you’re trying to figure out whether being tall is linked to a higher salary.  And so you try controlling for education, country, position, and everything you can think of, to see if there’s any residual effect for height.

But when you think of height affecting salary, you’d expect the effect to be *proportional*—a *percentage* of salary.  That is, maybe being tall is a 1% advantage.  If you’re working at Walmart where the average is $10 an hour, maybe you earn $10.10 if you’re tall.  If you’re a VP of sales, maybe you earn $100K if you’re normal height, but $101K if you’re tall.

Using log(salary) is how you make the regression do that.  Otherwise, you’d be averaging the VP and the WalMart guy, and you’d wind up with far less meaningful results, because they don’t match reality.

In baseball, it doesn’t work that way: one extra double isn’t worth X% of your salary—it’s worth a fixed amount, $500K or something.  And that’s why, in baseball, we use just plain salary instead of log(salary).


#20    Guy      (see all posts) 2010/11/13 (Sat) @ 10:44

"In baseball, it doesn’t work that way: one extra double isn’t worth X% of your salary—it’s worth a fixed amount, $500K or something.  And that’s why, in baseball, we use just plain salary instead of log(salary).”

It depends on the independent variables being used.  In some of the academic models, variables that may have a multiplicative impact are included, like free agency status or playing time.  Unfortunately, these models usually ALSO include variables that should have a fixed value, like OPS or ERA.  See for example the Hakes/Sauer Moneyball paper, http://hubcap.clemson.edu/~sauerr/working/moneyball-v2.pdf, Table 2.  Here log(salary) is predicted by variables with a presumed fix value (position, OBP, SLG) and also those whose impact is likely multiplicative (PA, FA status). 

This is done all the time, but it seems to me to be a mistake to include both types of variables in a log(salary) model.  However, a lot of you guys understand the methodology better than I do.  Am I off base?


#21    Colin Wyers      (see all posts) 2010/11/13 (Sat) @ 12:45

I think there’s a lot of confusion here about the point of taking the logs of variables being used in a regression. It has to do with the way Ordinary Least Squares works, nothing more or less.

One of the assumptions OLS makes is homoscedasticity - the idea that errors are distributed the same way throughout the data set. The corresponding term is heteroscedasticity:

http://en.wikipedia.org/wiki/Heteroscedasticity

There’s a great scatter plot showing what’s going on there.

If you have data like that, your standard errors will be biased, and by extension anything else that relies on standard errors, like p-values. One way to get around the problem is to take the log of variables that grow exponentially.


#22    Tangotiger      (see all posts) 2010/11/13 (Sat) @ 13:23

Colin, would it make a difference if salary is the independent variable (x) or dependent variable (y)?

After all, the x should have no variance for something like salary, right?  Unlike something like OBP.

It’s also not clear that the log will standardize the variance. Is it?

If I understand your link, it matters if the x parameter is something that is measured as being a sample or the actual thing with no measurement error.  Am I off base here?

***

Alan: yes, your equations match, except I don’t think that in terms of a regression that it would necessarily apply.  Let me try to construct an example on monday.


#23          (see all posts) 2010/11/13 (Sat) @ 13:48

I ain’t no Ph.D. statistician, but it seems to me that:

1.  Homoskedasticity is the result of having the right model.  The reason log(salary) is more likely to give you homoskedastic residuals is because the model works better. 

Two ways to look at it: you can say chemotherapy works because it kills the cancer cells while leaving the good cells untouched.  Or, you can say chemotherapy works because it makes the patient live longer.  But the second statement is true because of the first.

Similarly, “the residuals are homoskedastic” is true because “the model is properly specified and fits the data well” is true.

2.  Homoskedasticity is nice, but it’s only an assumption that’s necessary for confidence intervals.  You can still get a usable and useful equation of the relationship without it.

3.  If the real-life relationship is linear, and NOT proportional, then using log(salary) will CREATE heteroskedasticity.  That’s how you can guess it’s the wrong model.


#24          (see all posts) 2010/11/13 (Sat) @ 13:53

For instance, try to predict “total bill” based on number of Cappuccinos, number of Lattes, and number of butter tarts.

Then, try to predict “log(total bill)” based on the same variables.

I predict you will get homoskedasticity in the first case, but heteroskedasticity in the “log” case.  Reason: the first model makes sense, but the second does not.


#25    anon      (see all posts) 2010/11/13 (Sat) @ 18:44

Tango 16/17:  Except for sign, r and r^2 provide exactly the same information, so which ever you use basically comes down to personal preference and experience.  If you’ve built good intuition for one, there’s no reason to switch to the other.  You could equivalently build intuition for r^3.  Nobody does this because r^3 doesn’t have a useful mathematical interpretation, but both r and r^2 do.


#26    Matt Swartz      (see all posts) 2010/11/13 (Sat) @ 20:03

The issue with using OLS to determine a salary equation is that while it may predict salaries reasonably well, it will screw up the coefficients by biasing them towards zero.  It’s called attenuation bias, whereby if the X variable (or any of the variables in the X vector) are measured with error, the coefficient on X’s will be biased towards zero.  This is pretty easy to see with one variable.  If you run a regression of salary based on WAR, and WAR is not predicted perfectly by teams setting salaries (which is obviously true-- you can’t predict WAR with perfect certainty before the season), you would get the equation as:

sum(war(i) * salary(i)) / sum(war(i)^2),
for all individuals “i”

but since salary is based on expected performance, any variable from true performance would make the equation:

sum( (war(i)+error(i)) * salary(i) ) / sum( (war(i)+error(i))^2

even though in expectation error=0, expectations of error^2 > 0, meaning that the expected value of the top is

sum(war(i)*salary(i)) / sum(war(i)^2 + error(i)^2)

so the extra error(i)^2 in the denominator biases it towards zero.

The issue works better when you regress WAR as a function of salary, but since salary of multi-year deals is not really distributed based on expected performance, that is still measured with error.

That’s the issue with regressing to determine log salary in baseball.  It’s not about logs or r’s or r^2’s.  Regressing on regular non-logged salary is going to be the wrong model and hence cause heteroskedasticity.


#27    Depot      (see all posts) 2010/11/14 (Sun) @ 01:27

I’m actually not sure what the issue really is so I might be missing the point here.  With OLS, you’re constrained by linearity.  Log-transformations just assume a different type of specification (as said in 1): salary=a*exp(x’b).  There’s nothing “right” or “wrong” about salary=a+x’b.  You’re just using a different model.  With the log transformation, the R^2 tells you how much the estimated a*exp(x’b) explains of the salary variable.  I don’t see why this is strange.


#28    Guy      (see all posts) 2010/11/14 (Sun) @ 10:18

The issue that Phil and I are raising (which may or may not be related to Tango’s original point) is that log(salary) models yield conclusions such as:  “a 10-point increase in OBP increases salary by 3%” (#s for illustration).  However, this can’t (or shouldn’t) be true:  a 10-point increase in OBP is worth the same regardless of a player’s other contributions.  For a variable like Free Agency, it might well be true that it increases salary by 120% (rather than a fixed $$ amount).  But for performance variables, the impact should be a multiplier effect.

Can someone explain why this is not in fact a problem for the log models?


#29    Tangotiger      (see all posts) 2010/11/14 (Sun) @ 11:54

Colin, your example is fascinating, and I see why r-squared, in that example, makes sense in terms of describing what it means.

Basically, it comes down to whether what you are measuring has any measurement error to it or not.

In my example, there is no measurement error.  I could have for example a table of Fahrenheit and Celsius, and I make sure that half of them translate perfectly, and the other half are random.  And so, the r=.5, not the r-squared=.25, explains the relationship.

But in your example, a more common example, where you have data points that has measurement errors, it’s the r-squared of .50 and not the r of .70 that explains the relationship.

Basically, you have to know something about what the independent variable represents in order to understand whether r or r-squared represents what we see in the data (in English).

***

I disagree with the contention that r and r-squared is the same thing because one derives the other.  The question on the table is if you see r-squared=.25, what does it mean in English.  In my example, who cares that 25% of the “variance” is explained by the independent variables.  What we care about in my example is that half of the dependent values are perfectly explained by the independent values.

In Colin’s excellent example, he has a measurement error around the independent variable.  And so, the r-squared figure captures what he is after.

People are treating the r and r-squared mostly in terms of a scale of -1 to 1 or 0 to 1, and inferring something based on the value itself, as if it kinda means something, but only in terms of what experience tells them.  “Hey, I got an r=.50!”.  Well, what does that mean in English?  And depending on what the independent variables are (measured, or real), it’ll mean two totally different things.  r=.50 is not r=.50 is not r=.50.


#30          (see all posts) 2010/11/15 (Mon) @ 18:51

I appreciate a desire to simplify statistics into plain English that is more consumable by the general public.  However, I sometimes feel like there are as many mistakes made by doing this as there are benefits.  For instance, someone might publish a great study saying that R^2 = .5 at 500 PA for OPS (numbers for illustration), and that gets translated by some to “OPS doesn’t stabilize until 500 PA”.  This then wrongly gives the impression that stability is some sort of binary condition rather than a continuum.  We still have meaningful results at 400 PA, and we of course have even MORE meaningful results at 600 PA.

The issue is not limited to R and R^2.  I could go on and on about “statistical significance” derived from a p-value less than the ubiquitous 0.05…


#31    Professor Spiegel      (see all posts) 2010/11/15 (Mon) @ 23:58

Don’t poormouth variance out of hand. It ties in more intuitively with eigenvalues/eigenvectors, or something like that.


#32    Tangotiger      (see all posts) 2010/11/16 (Tue) @ 01:04

"professor”: I didn’t do anything “out of hand”. Sheesh.  I thought I was pretty clear what I am talking about.


#33    James      (see all posts) 2010/12/15 (Wed) @ 08:08

Tango/16
Its not that I prefer MSE to RMSE its that I prefer SE to E.
You appear to me to be arguing that using the reduction in SE is meaningless to describe how “good” a regression line is but you appear to still use minimizing SE as the method to draw the line in the first place.

If you argued that we should draw regression lines minimizing absolute error rather than minimixing SE then your argument would be consisitent (although then there would be a stats discussion to have).

In other words consider the following two statements
a) Draw your regression line to minimixe the squares of the erors between the datapoints and your line.
b) Describe how good the line fits the data by calculating how much of the squared errors your regression line has reduced compared to a straight line through the average parallel to the x axis. (the r2).

I agree with both a and b you disagree with b but you also seem to agree with a which appears contradictory to me.

James


Page 1 of 1 pages


Name (required)
E-Mail (optional; WILL be published)
Website (optional)

<< Back to main


Latest...

COMMENTS

May 25 11:02
Do pitcher’s reach back for velocity when needed?

May 25 10:58
Rooting for laundry

May 25 10:14
Largest demonstration in Canadian history?

May 25 09:39
What sabermetrics is NOT

May 25 06:39
Lack of hustle during a game

May 25 02:38
NFLPA lawsuit against collusion

May 25 01:43
Neal Huntington’s best moves

May 24 17:04
Firefox, IE, or Chrome?

May 24 12:07
How to beat the shift

May 24 11:11
Incredible story