The closer a regression line is to its sd line the the correlation coefficient is

The strength of the linear association between two variables is quantified by the correlation coefficient.
Given a set of observations (x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation coefficient is given by
The closer a regression line is to its sd line the the correlation coefficient is

The correlation coefficient always takes a value between -1 and 1, with 1 or -1 indicating perfect correlation (all points would lie along a straight line in this case). A positive correlation indicates a positive association between the variables (increasing values in one variable correspond to increasing values in the other variable), while a negative correlation indicates a negative association between the variables (increasing values is one variable correspond to decreasing values in the other variable). A correlation value close to 0 indicates no association between the variables.

Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value. For this reason, the correlation coefficient is often more useful than a graphical depiction in determining the strength of the association between two variables.


Correlation in Linear Regression

The square of the correlation coefficient, r², is a useful value in linear regression. This value represents the fraction of the variation in one variable that may be explained by the other variable. Thus, if a correlation of 0.8 is observed between two variables (say, height and weight, for example), then a linear regression model attempting to explain either variable in terms of the other variable will account for 64% of the variability in the data.

The correlation coefficient also relates directly to the regression line Y = a + bX for any two variables, where

The closer a regression line is to its sd line the the correlation coefficient is
.
Because the least-squares regression line will always pass through the means of x and y, the regression line may be entirely described by the means, standard deviations, and correlation of the two variables under investigation.

For some good examples of studies using correlation, see the correlation index in the Statlib Data and Story Library (DASL).

The correlation coefficient r is directly related to the coefficient of determination r2 in the obvious way. If r2 is represented in decimal form, e.g. 0.39 or 0.87, then all we have to do to obtain r is to take the square root of r2:

\[r= \pm \sqrt{r^2}\]

The sign of r depends on the sign of the estimated slope coefficient b1:

  • If b1 is negative, then r takes a negative sign.
  • If b1 is positive, then r takes a positive sign.

That is, the estimated slope and the correlation coefficient r always share the same sign. Furthermore, because r2 is always a number between 0 and 1, the correlation coefficient r is always a number between -1 and 1.

One advantage of r is that it is unitless, allowing researchers to make sense of correlation coefficients calculated on different data sets with different units. The "unitless-ness" of the measure can be seen from an alternative formula for r, namely:

\[r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}\]

If x is the height of an individual measured in inches and y is the weight of the individual measured in pounds, then the units for the numerator is inches × pounds. Similarly, the units for the denominator is inches × pounds. Because they are the same, the units in the numerator and denominator cancel each other out, yielding a "unitless" measure.

Another formula for r that you might see in the regression literature is one that illustrates how the correlation coefficient r is a function of the estimated slope coefficient b1:

\[r=\frac{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\times b_1\]

We are readily able to see from this version of the formula that:

  • The estimated slope b1 of the regression line and the correlation coefficient r always share the same sign. If you don't see why this must be true, view this screencast.
  • The correlation coefficient r is a unitless measure. If you don't see why this must be true, view this screencast.
  • If the estimated slope b1 of the regression line is 0, then the correlation coefficient r must also be 0.

That's enough with the formulas! As always, we will let statistical software such as R or Minitab do the dirty calculations for us. For the skin cancer mortality and latitude example (skincancer.txt), the correlation between skin cancer mortality and latitude is -0.825. It doesn't matter the order in which you specify the variables, so the correlation between latitude and skin cancer mortality is also -0.825. What does this correlation coefficient tells us? That is, how do we interpret the Pearson correlation coefficient r? In general, there is no nice practical operational interpretation for r as there is for r2. You can only use r to make a statement about the strength of the linear relationship between x and y. In general:

  • If r = -1, then there is a perfect negative linear relationship between x and y.
  • If r = 1, then there is a perfect positive linear relationship between x and y.
  • If r = 0, then there is no linear relationship between x and y.

All other values of r tell us that the relationship between x and y is not perfect. The closer r is to 0, the weaker the linear relationship. The closer r is to -1, the stronger the negative linear relationship. And, the closer r is to 1, the stronger the positive linear relationship. As is the case for the r2 value, what is deemed a "large" correlation coefficient r value depends greatly on the research area.

So, what does the correlation of -0.825 between skin cancer mortality and latitude tell us? It tells us:

  • The relationship is negative. As the latitude increases, the skin cancer mortality rate decreases (linearly).
  • The relationship is quite strong (since the value is pretty close to -1)

The most commonly used techniques for investigating the relationship between two quantitative variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.

What does it mean when a correlation coefficient is close to?

How do you interpret a correlation coefficient? Correlations range from -1.00 to +1.00. The correlation coefficient (expressed as r ) shows the direction and strength of a relationship between two variables. The closer the r value is to +1 or -1, the stronger the linear relationship between the two variables is.

What does it mean if the points are close to the regression line?

Here's the basic idea: The closer the line is to all of the data points, the better the line summarizes the pattern in the data. Notice when the line is close to the data points, it gives better predictions.

Is correlation coefficient the same as standard deviation?

Standard deviation is a measure of the dispersion of data from its average. Covariance shows whether the two variables tend to move in the same direction, while the correlation coefficient measures the strength of that relationship on a normalized scale, from -1 to 1.