Explorations in statistics: correlation

Douglas Curran-Everett

Abstract

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This sixth installment of Explorations in Statistics explores correlation, a familiar technique that estimates the magnitude of a straight-line relationship between two variables. Correlation is meaningful only when the two variables are true random variables: for example, if we restrict in some way the variability of one variable, then the magnitude of the correlation will decrease. Correlation cannot help us decide if changes in one variable result in changes in the second variable, if changes in the second variable result in changes in the first variable, or if changes in a third variable result in concurrent changes in the first two variables. Correlation can help provide us with evidence that study of the nature of the relationship between x and y may be warranted in an actual experiment in which one of them is controlled.

  • Sir Frances Galton
  • R
  • regression
  • spurious correlation

I can only say that there is a vast field of topics that fall under the laws of correlation, which lies quite open to the research of any competent person who cares to investigate it. Sir Francis Galton (1890)

This sixth paper in Explorations in Statistics (see Refs. 48) explores correlation, a technique that estimates the magnitude of a straight-line relationship between two variables. By its very nature, correlation epitomizes the difference between statistical significance and scientific importance (5, 6, 10). Although basic textbooks of statistics (3, 30, 40) discuss correlation, the value of correlation is limited (23, 25, 40). Nevertheless, correlation has a rich history, and it can help provide us with evidence of a relationship between two variables (30).

A Brief History of Correlation

Sir Francis Galton (Fig. 1) pioneered correlation (21, 35, 36, 39a, 42, 43). Galton, a cousin of Charles Darwin, did a lot: he studied medicine, he explored Africa, he published in psychology and anthropology, he developed graphic techniques to map the weather (39a, 42). And, like others of his era, Galton strove to understand heredity (13, 14, 17, 20).

Fig. 1.

Sir Francis Galton, sometime during the 1870s. [Reproduced with permission from Gavan Tredoux and http://Galton.org/.]

In 1877, Galton unveiled reversion, the earliest ancestor of correlation, and described it like this (13): Reversion is the tendency of that ideal mean type to depart from the parent type, reverting towards what may be roughly and perhaps fairly described as the average ancestral type.

The empirical fodder for this observation? The weights of 490 sweet peas. Nine years later, Galton (14) summarized his sweet pea observations in this way: It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but to be always more mediocre than they–to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were very small.

In Galton's subsequent writings (14, 17, 20), reversion evolved into regression.

It was in 1888 that Galton (15) first wrote about correlation: Two variable organs are said to be co-related when the variation of the one is accompanied on the average by more or less variation of the other, and in the same direction … It is easy to see that co-relation must be the consequence of the variations of the two organs being partly due to common causes. If they were wholly due to common causes, the co-relation would be perfect, as is approximately the case with the symmetrically disposed parts of the body. If they were in no respect due to common causes, the co-relation would be nil The statures of kinsmen are co-related variables; thus, the stature of the father is correlated to that of the adult son … ; the stature of the uncle to that of the adult nephew, … and so on; but the index of co-relation, which is what I there [Ref. 14] called regression, is different in the different cases.

By 1889, Galton was writing co-relation as correlation (42), and he had become fascinated by fingerprints (16, 19). Galton's 1890 account of his development of correlation (18) would be his last substantive paper on the subject (43).

Karl Pearson, Galton's colleague and friend, and father of Egon Pearson, pursued the refinement of correlation (33, 34, 37) with such vigor that the statistic r, a statistic Galton called the index of co-relation (15) and Pearson called the Galton coefficient of reversion (36), is known today as Pearson's r.

With this brief history, we are ready to begin our exploration of correlation.

R: Basic Operations

In the first paper (4) of this series, I summarized R (38) and outlined its installation. For this exploration, there are two additional steps: download Advances_Statistics_Code_Corr.R1 to your Advances folder and install the extra package MASS.2

To install MASS, open R and then click Packages | Install package(s) …3 Select a CRAN mirror4 close to your location and then click OK. Select MASS and then click OK. When you have installed MASS, you will see package ‘MASS’ successfully unpacked and sums checked in the R console.

To run R commands.

If you use a Mac, highlight the commands you want to submit and then press . If you use a PC, highlight the commands you want to submit, right-click, and then click Run line or selection. Or, highlight the commands you want to submit and then press Ctrl+R.

The Simulations: Observations and Sample Statistics

In our early explorations (46) we drew 1000 random samples of 9 observations from a standard normal distribution with mean μ = 0 and standard deviation σ = 1. For this exploration, we want to draw instead a single sample of 100 observations from 9 bivariate normal distributions (Figs. 24).

Fig. 2.

A bivariate normal distribution. The distribution of both X and Y is a standard normal distribution with mean μ = 0 and standard deviation σ = 1, and the correlation between X and Y is 0.

For each sample of 100 observations, we calculate the sample statistics listed in Table 1. These are the statistics for the 9 samples: >#rhoAve_xSD_xAve_ySD_yrPIntrcptSlopeR_Sqrd>#>SampleStats[,1][,2][,3][,4][,5][,6][,7][,8][,9][,10][1,]0.10.06320.84120.13361.00310.13910.16760.14410.16580.0193[2,]0.10.00150.95160.07980.92960.25990.00900.07940.25390.0675[3,]0.30.00651.02000.02071.02350.41550.00000.02340.41590.1718[4,]0.40.16821.08990.13711.09710.46500.00000.05840.46810.2162[5,]0.50.16381.06470.17370.99400.56780.00000.08690.53010.3224[6,]0.60.14001.08050.15981.16920.62140.00000.06570.67240.3861[7,]0.70.12580.99240.06821.06840.73000.00000.03070.78590.5329[8,]0.80.03031.01570.00181.07140.81410.00000.02420.85880.6628[9,]0.90.19760.92590.15270.92370.90760.00000.02620.90550.8237

View this table:
Table 1.

Sample statistics calculated for each random sample

The commands in lines 25–60 of Advances_Statistics_Code_Corr.R compute these statistics. The values of your statistics will differ.

With these 9 sets of sample observations and statistics, we are ready to explore correlation.

Correlation

As we know, correlation estimates the magnitude of the straight-line relationship between two variables. The usual correlation statistic is Pearson's r, defined as r=(xix¯)(yiy¯)(xix¯)2(yiy¯)2   , (1) where xi and yi are the values of x and y for observation i and where and ȳ are the sample means of x and y (11, 30, 39, 40).5 By virtue of the Cauchy-Schwarz inequality (28), the value of correlation coefficient r can vary from −1 to +1.6 If r = −1, then the observations fall on a straight line whose slope is negative. If r = +1, then the observations fall on a straight line whose slope is positive. Figure 4 depicts the straight-line relationship between x and y for correlation coefficients that range from −0.14 to −0.91.

In our first exploration, we generated theoretical distributions for two sample statistics: the standard deviation and the mean (Ref. 4, Figs. 3 and 5). We can do the same thing for the sample correlation coefficient r (Fig. 5).

Fig. 3.

The populations. For each bivariate normal distribution, the distribution of both X and Y is a standard normal distribution with mean μ = 0 and standard deviation σ = 1. The correlation between X and Y varies from −0.9 (top left) to +0.9 (bottom right).

Fig. 4.

The samples. Each sample of 100 observations was drawn at random from a bivariate normal distribution with a correlation coefficient ρ (top left) that varied from −0.1 to −0.9 (see Fig. 3). For each sample, the correlation coefficient r (top right) and its corresponding P value (bottom right) are listed. Axes (gray) are at x = 0 and y = 0. The commands in lines 25–74 of Advances_Statistics_Code_Corr.R create this data graphic. To generate this data graphic, highlight and submit the lines of code from Figure 4: first line to Figure 4: last line.

Fig. 5.

The theoretical distribution of the sample correlation coefficient for 100 observations drawn from a bivariate normal distribution with a correlation ρ of 0, 0.3, 0.5, 0.7, or 0.9. Inset: the theoretical distribution of the sample correlation coefficient for 10 observations drawn from a bivariate normal distribution with a correlation ρ of −0.7, −0.5, or 0. Calculated from Refs. 12 and 45.

If we restrict x or y in some way, then the correlation coefficient will reflect not only the relationship between x and y but also our restriction (30). For example, if we decrease the variability of x, then the magnitude of the correlation coefficient will decrease (Fig. 6); this is a general phenomenon (Fig. 7). The message: correlation is meaningful only when x and y are random variables (23, 30, 40).

Fig. 6.

The impact of decreased variability on the correlation coefficient. For the 100 original observations (black and gray circles), the sample correlation coefficient r = −0.91 (see Fig. 4, bottom right). If we estimate the sample correlation coefficient using only those observations from the middle 50% of x values (black circles), the correlation coefficient r decreases to −0.70. Axes (gray) are at x = 0 and y = 0. The commands in lines 85–129 of Advances_Statistics_Code_Corr.R generate this simulation. To generate this data graphic, highlight and submit the lines of code from Figure 6: first line to Figure 6: last line.

Fig. 7.

The impact of decreased variability on the expected value of the correlation coefficient. We draw at random 1000 samples of n observations from a bivariate normal distribution with a correlation coefficient ρ that varies from 0.1 to 0.9. For each sample, we estimate the correlation coefficient rn. Then, we recalculate the sample correlation coefficient, now denoted rn/2, using only those observations from the middle 50% of x values (see Fig. 6). Last, we compute the change Δr in the sample correlation coefficient as Δr = rn/2rn. If Δr is negative, then the sample correlation coefficient r decreases when we restrict the values of x. In general, regardless of sample size, the expected value–the average Δr–of the sample correlation coefficient decreases. The lone exception: the average Δr is 0 when the correlation coefficient ρ = 0.1 and when the sample size n = 100. The most pronounced decrease occurs when the population correlation coefficient ρ is close to 0.75.

Moreover, the apparent meaning of a correlation can be distorted by a single observation (Fig. 8). Applets (22, 29, 41) are useful tools with which to explore this effect.

Fig. 8.

The impact of a single point on the correlation coefficient. For the 11 original observations (black circles), the sample correlation coefficient r = +0.56 (P = 0.07). Data are a subset of those reported by Pearson and Lee (37) and used by Snedecor and Cochran (40). If the point (65, 59) becomes instead (65, 69), the correlation coefficient r decreases to −0.02 (P = 0.96). The commands in lines 138–163 of Advances_Statistics_Code_Corr.R generate this simulation. To generate this data graphic, highlight and submit the lines of code from Figure 8: first line to Figure 8: last line.

Relationship to Regression

Just as we can use correlation to explore a relationship between two variables, so too can we use regression. Regression estimates a straight-line relationship between x and y as ŷ=b0+b1x   , where ŷ is the predicted value of the response y, b0 is the y intercept, and b1 is the slope, the change in y when x increases by 1 unit. The statistic R2 characterizes the proportion of the variation in y that is accounted for by the regression (11). As you might expect, correlation is related to regression.

The relationship between the correlation coefficient r and the slope b1 is b1=rsysx, where sy and sx are the sample standard deviations of y and x (11, 40). In this exploration, we defined the standard deviation σ of X and Y to be 1 (see Fig. 3). This means we expect sy and sx to be similar. As a result, the slope b1 will approximate the correlation coefficient r. We can use our sample statistics to confirm this empirically. Suppose we use the statistics from when the correlation ρ was −0.2: b1=rsysx=0.25990.92960.9516=0.2539   . This derived value matches the value of the slope we obtained from regression (Table 1, column 9). Regardless of sy and sx, if the correlation coefficient r is 0, then the slope b1 will be 0.

The relationship between r and the regression statistic R2 is direct (11, 40): R2=r2, We can use our sample correlation coefficient r = −0.2599 to confirm this: R2=(0.2599)2=0.0675. This derived value matches the value we obtained from regression (Table 1, column 10).

With this background, the next question is, how do we interpret a correlation?

Interpretation

Suppose we want to estimate ρ, the correlation between two physiological things we care about. We define the null and alternative hypotheses, H0 and H1, as H0:ρ=0H1:ρ0. That is, the sample observations are consistent with having come from a bivariate population that has a correlation ρ of 0, or the sample observations are consistent with having come from a bivariate population that has a correlation ρ other than 0 (6).

To test this null hypothesis, suppose we draw at random 100 observations from a bivariate normal distribution with a correlation ρ = −0.2 (see Fig. 3, middle left). The scatterplot in Fig. 4, top middle, depicts the relationship we observe between x and y, the things we care about.

When we estimate the sample correlation coefficient r, we obtain r = −0.2599 ≐ −0.26. Associated with this value of r is a P value of 0.009. What do we do with r = −0.26? We interpret it within the context of a true null hypothesis: if the true correlation is 0, how usual is this value of r? If the null hypothesis is true, we expect to observe a value of |r| at least as big as 0.26 just 9 times in 1000 (P = 0.009).7 Suppose we established beforehand a benchmark of α = 0.01 (9). If the null hypothesis is true, the sample correlation |r| = 0.26 is more unusual than our benchmark. As a result, we reject the null hypothesis and conclude that the sample observations are consistent with having come from a bivariate population that has a correlation ρ other than 0.

The statistical conclusion is clear: changes in y are related to changes in x. The scientific question is, how well do changes in x account for changes in y? We can answer this question if we square r to get R2: R2=(0.2599)2=0.06750.07. How well do changes in x account for changes in y? Not well: changes in x account for 7% of the variation in y. The scatterplot in Fig. 4, top middle, confirms this weak relationship.

Bear in mind that a correlation may not be meaningful even when changes in x account for most of the changes in y (Fig. 9). The message: a scatterplot is essential if you want to interpret a correlation.

Fig. 9.

Scatterplots of 2 samples of 11 observations. For each, the fitted first-order regression model is ŷ = 3 + 0.5x, the correlation coefficient r is 0.82, and R2 is 0.67 (1, 10). For only one sample (gray circles), however, is the correlation coefficient meaningful. For the second (black circles), a second-order model of the form Y = β0 + β1X + β2X2 + ε describes the relationship between X and Y, and the correlation coefficient underestimates the magnitude of the actual relationship between X and Y.

A correlation is not meaningful if x and y are related through computation. For some physiological response, suppose we consider the initial value yi and the subsequent change Δy to be drawn at random from a normal distribution with mean μ = 0 and standard deviation σ = 1. There is no relationship between the change Δy and the initial value yi (Fig. 10). But if we calculate the final value yf as yf = yi + Δy, we create an obvious relationship between the final value yf and the initial value yi. This phenomenon is one example of spurious correlation, recognized by Pearson (34) and condemned by Neyman (31): “Spurious correlations have been ruining empirical statistical research from times immemorial.”

Fig. 10.

Scatterplots of the change in some response (top) and the final value (bottom) against the initial value for 1000 observations. Despite the absence of a relationship between the change and the initial value (r = −0.02), there is an obvious relationship between the final value and the initial value (r = 0.70). This is the phenomenon of mathematical coupling. The commands in lines 170–208 of Advances_Statistics_Code_Corr.R create this data graphic. To generate this data graphic, highlight and submit the lines of code from Figure 10: first line to Figure 10: last line.

A correlation is also not meaningful if x and y are values that result from different ways of measuring the same thing (2, 23, 26, 27, 40). In this situation, x and y will be related simply because they estimate the same quantity.

Imagine this: you discover that changes in y are related to changes in x (P = 0.01) and, even better, that changes in x account for 90% of the variation in y. You like this. A lot. But what do you make of this result? You have three choices:

  • 1. Changes in x result in changes in y.

  • 2. Changes in y result in changes in x.

  • 3. Changes in z result in changes in x and y.

From correlation alone, you have no idea what is going on.

When I teach my statistics course, I use these examples (24) to illustrate the tenuous nature of correlation: Lice make you healthy. You say you want evidence? On islands in the South Pacific, healthy people have lice. Sick people do not. What more do you need? The explanation? When someone gets sick, they often develop a fever. Lice do not like higher temperatures. There is a tight relationship between the salaries of Presbyterian ministers in Massachusetts and the price of rum in Havana. The difficult question is, do the ministers support the rum trade, or do they profit from it? The answer is neither: worldwide inflation affects salaries and the price of rum.

In the first example, it is not that lice (x) make you healthy (y). Rather, it is that being sick with a fever (y) drives off lice (x). In the latter example, worldwide inflation (z) affects salaries (x) and the price of rum (y).

Summary

As this exploration has demonstrated, correlation estimates the magnitude of a simple straight-line relationship between two variables, and it embodies the fundamental concepts of statistical significance and scientific importance (5, 6, 10). Still, correlation cannot help us decide if changes in x result in changes in y, if changes in y result in changes in x, or if changes in a third variable result in simultaneous changes in x and y. Correlation can help provide us with evidence that study of the nature of the relationship between x and y may be warranted in an actual experiment in which one of them is controlled.

In the next installment of this series, we will explore regression, a technique that, like correlation, also estimates the nature of a straight-line relationship between two variables. Unlike correlation, however, regression can estimate the nature of different kinds of relationships between two variables. Because of this versatility, regression is a far more useful technique.

DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

ACKNOWLEDGMENTS

I thank John Ludbrook (Department of Surgery, The University of Melbourne, Melbourne, Victoria, Australia) and Lori Silveira and Matthew Strand (National Jewish Health, Denver, CO) for their helpful comments and suggestions.

Footnotes

  • 1 This file is available through the Supplemental Material link for this article at the Advances in Physiology Education website.

  • 2 MASS was so named because its functions support Modern Applied Statistics with S (44).

  • 3 The notation click A | B means click A, then click B.

  • 4 CRAN stands for Comprehensive R Archive Network. A mirror is a duplicate server.

  • 5 Pearson's r can be written in other ways (23, 30, 39, 40).

  • 6 The Cauchy-Schwarz inequality dictates that |x·y| ≤ |x|·|y|. In Eq. 1, this means the magnitude of the numerator is less than or equal to the magnitude of the denominator. As a result, |r| ≤ 1.

  • 7 One test statistic with which we can assess H0: ρ = 0 is the familiar t statistic. Because the distribution of the sample correlation coefficient r is complex (Fig. 5), we compute t as t=r2(n2)/(1r2)   , where n is the number of sample observations (30, 40). The degrees of freedom for this t distribution is n − 2.

REFERENCES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 33.
  33. 34.
  34. 35.
  35. 36.
  36. 37.
  37. 38.
  38. 39.
  39. 39a.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
View Abstract