Friday, November 12, 2010
r v r-squared
Let me explain something with a simple example. Open up Excel, and put =rand() in cells A1 through A1000, and in B501 through B1000. In cell B1, put =A1. Copy that to B500 (where it will show =A500). So, for half the records, the data in columns A and B are identical, and in the other half, there is no relationship whatsoever.
Suppose you were asked to explain in english the data, without resorting to correlation or variances? Well, you would say that exactly 50% of the data points in column A are perfectly explained by those in column B, while the other 50% of the data points are completely random. Now, how do you think you would represent that as a number? Well, something like: myNumericTranslationOfWhatISee = .50. Or, more simply, r=.50.
If you run a regression using the Data Analysis package, what will be the result? r=.50 and r-squared=.25.
Now, we just figured out what r is. What does r-squared represent? r-squared, or the coefficient of determination:
The squared correlation coefficient (r2) is the proportion of variance in Y that can be accounted for by knowing X. Conversely, it is the proportion of variance in X that can be accounted for by knowing Y. The squared correlation coefficient is also known as the coefficient of determination. It is one of the best means for evaluating the strength of a relationship. For example, we know that the correlation between height and weight is approximately r=.70 If we square this number to find the coefficient of determination - r-squared=.49 Thus, 49 percent of one’s weight is directly accounted for one’s height and vice versa.
Variance is the square of standard deviation. Standard deviation is something that we can understand. Variance is the square of that. r-squared is based on the variance of the data points. Using variance sounds like bullsh!t to me.
Specifically, how is 49% of the weight directly accounted for? It’s not. 49% of the variance is accounted for. However, who cares about the variance?
In my simple example, what numerical value explains the data, 0.50 or 0.25? I don’t see any need to have to be shown r-squared.
***
I have similar misgivings when taking the log of salary, and then trying to best-fit against the log, and then people talking about as if the correlation is against salary itself. Well, it’s not. It’s against the log of salary and we don’t care about that! Why do researchers use the log of salary? Because they don’t want to best-fit against a wildly curved (exponential) line (which is what they should do) and instead want to best-fit against a more controlled line, because they are afraid of how much the extreme (unlogged) data point would affect the shape of the best-fit line.
Fine. But what the researchers are doing is best-fitting to the log of the salary. They are intentionally accepting to potentially miss big on the unlogged salaries at the extreme end.
***
These two things I’ve always thought of as bullsh!t. Tell me I’m wrong.


Not sure exactly what you are getting at with the log of salary thing. Let me give an example and you can let me know if I am on the right track. Suppose variable y is exponentially related to variable x: y=a*exp(b*x). I measure y as a function of x, then do a non-linear least-squares fit to find the best values of a and b. I do a weighted fit. Suppose for the sake of argument that all the points are equally weighted (i.e., sig_y=C, a constant). Then I don’t even need to do a weighted fit.
Suppose now that I fit ln(y) to the function A+b*x
(where A=ln(a)). I claim I will get exactly the same answer as before provided I appropriately adjust the weighting factors. In particular,
sig_ln(y)=C/y. More generally, sig_ln(y)=sig_y/y.