Key Points

Correlation and regression are used with continuous variables

Plot the variables in correlation and regression relationships to aid interpretation

An association between two discrete measurements is assessed by correlation

Regression describes and quantifies a relationship between an independent factor and a dependent variable; prediction is also possible

Few biological relationships are truly linear

Regression can be distorted by outlying values

Absence of a linear regression does not mean a relationship is not present

Regression is very frequently misused and misapplied
our previous article in this series considered analysis of variance (4), which is applied to measurements made of samples from different groups. That approach considers the variation of the data, and specifically the variation between samples, which is related to the presence of different levels of the factor related to a group. The variation between samples is contrasted with the residual variation, found within the samples. We consider the possibility that the groups could be different, because of the different conditions of a factor. This is as far as the analysis can extend: the consideration is restricted to groups characterized by the different category of the factor being considered. For example, in Figure 1A, the factor we had considered in our samples of jumping frogs is the State from which they were sampled. The data were categorized according to the geographic origin of the frogs. However, in many biological experiments, the factor considered may not be just a simple category, but it may be expressed in terms of order, or even as a continuous variable. If this can be done, then other helpful and powerful methods of statistical analysis can be used.
Ranked tests.
If the categories can be ranked, then tests can exploit this ranking. These include the MannWhitney (also known as the Wilcoxon rank sum) test, and Kendall's and Spearman's rank correlation tests, and simple examples are clearly described by Moses and colleagues (7). By arranging the levels in a way that allows a factor to be logically graded (such as either not present, mild, moderate, or severe) ordered levels are related to the measurements. For example, in Figure 1B we have ranked the geographic origin from east to west, and there appears to be a possible association.
Quantitative variables.
In many other experiments, we use quantitative variables rather than ranked categories, and relate the values observed in one quantitative variable to another associated quantitative variable. The linkage, or association, between them can be mathematically determined by their correlation. When correlation is calculated, the strength of the association indicates how much of the variation of the two features occurs in the ‘same direction'. A simple example might be body weight and height: if the correlation between these measures in a sample of adult females were perfect the strength of the association, indicated by the correlation coefficient, would be 1, indicating that these two factors are exactly linked. The correlation coefficient can range from −1 (a perfect inverse association) to 1 (a perfect positive association). The mathematics of correlation are explained by CurranEverett (2).
When using correlation procedures, a simple plot should be made and inspected first. In biology, truly linear associations are probably rare. Many interactions between factors may tend to a maximum, some are reciprocal, others could be logarithmic; forcing a straight line through such relationships is possible but illogical. The common correlation method used, Pearson's method, is only valid if at least one and preferably both of the measures are normally distributed. A dot plot of the association may show the distribution and prevent inappropriate use of the method. A dot plot can also prevent illogical conclusions. Figure 2 shows some of the patterns that may be found.
Correlation is generally only applicable if the variation in the measurements is uniform. Example B in Figure 2 shows a pattern that is not uncommon in biology, where the variability of the measures gets greater as the values increase. This feature (heteroscedascity) makes simple analysis unwise, although there are means to correct for it. In some instances (E and F), a straight line is not the most appropriate way to describe the link; in some the relationship may be more complex (two different groups in panel G); and in others, the correlation may be the result of outlying points (panel H) that deserve careful consideration before conclusions are drawn.
Correlation is very frequently abused. If many variables are measured, and correlations sought, then some of them–often unrelated–will correlate, by chance. In particular, possibly false associations may be drawn from a time series, such as changes in the prevalence of obesity and the manufacture of fashion clothing over a number of years. A correlation may result from data that are biased: if for example we only measured the frogs we could catch easily, we might not appreciate that Ohio frogs were even better jumpers. We have already considered how the effects of another factor such as sex (a covariate) could skew the measurements: finding young female frogs and older males could cause a false association. Correlation is not necessarily the best or only means to assess agreement between two methods of measurement. Mathematical linkage between two measurements causes spurious correlation (1). Obvious links could be a change in weight in relation to starting weight, or relating a part to the whole. Others are more subtle: for example, when oxygen delivery and oxygen consumption are both calculated using the same measure of cardiac output and of arterial oxygen content, this generates a false association (8).
'Equal' association in two measures, assessed by correlation, is not often sought in experimental situations. More often in the laboratory, measurements are considered to be affected by a factor that is not only quantifiable, but one that can be adjusted or predetermined, rather than randomly occurring. For this type of analysis, linear regression is often used. This should be approached as a separate statistical method from correlation, although it is often considered in the same chapter of the statistical textbooks.
Linear regression.
What is linear regression? When we considered analysis of variance, we attributed some of the variation in a measurement to factors that were classified as categories: the State where the frog was found, for example. In the theory of regression analysis, variation is attributed not to a specific category, but to an input factor that varies continuously. It aims to describe mathematically the association between the measured value and this input factor. Variation in the ‘dependent' measure is explained in part by the magnitude of this input value, which is termed the ‘independent' variable. In a simple regression analysis, the ‘% explained' is given by the square of the regression coefficient, R^{2}, expressed as a percentage. Thus if R^{2} were 0.7, in a simple regression analysis we can say that 70% of the variation in the dependent values is attributable to the independent value. This approach can be extended by considering several factors that could ‘explain' the variation.
In our example, we wish to explore the association between origin and jumping ability. We could express the origin of the frogs in terms of the longitude of where they were found. We choose to use longitude as a convenient continuous variable to ‘quantify' origin. (Strictly speaking, this violates one of the assumptions of regression, which is that the independent variable should be normally distributed. In our example we have data from three States, which do not meet this assumption, as can be seen from the data plotted in Figure 1C, showing the jump length of frogs found at different longitudes.) We wish to derive what association there is between jump length and the longitude that the frog comes from, so that we may attribute some of the variation in jump length to this factor. A general, simple, equation can be based on the classic equation for a straight line: If the independent variable is X, and the dependent variable is Y, then And in our example The constants m and b are calculated to minimize the differences between the observed jumps and the jumps that would be predicted, at that longitude. Indeed, the equation allows us to predict, to some extent, the value of the dependent variable, for a given value of the independent variable. Extrapolation is unwise: in our example, the relationship is only likely to apply between Ohio and California in the USA, not least because there are few frogs to be found in the Pacific Ocean. The mathematical background of regression is described by CurranEverett (3).
One approach, often used in preliminary analysis, is to average the jumps from each State (as is done in ANOVA). ANOVA showed us that there was indeed a difference between Ohio and the other States. However, this approach neglects what we consider an important feature of the data, which is the ‘variation' in origin, and which is present in the samples for each State. In Figure 3A we show the calculated linear regression line with the 95% confidence limits of this line.
It is unlikely that we would have obtained these data if there were no difference between the distances that the frogs jumped, in relation to their longitude of origin. However, the R^{2} value is small: although 12% of the variation in jump length can be explained by longitude, there remains a lot of variation at each longitude that cannot be attributed to this factor.
One of the reasons that we were able to pick out this small signal (longitude has a small effect) is that we used large samples, and travelled far to collect our frogs. A smaller sample, or a sample which had less variation in longitude might not have shown this effect. Figure 3B shows that using the same analysis within a single State, even a large one like Texas, fails to detect this effect. Clearly, caution is needed when interpreting such data: is it biologically plausible that longitude is important? Maybe it's the great divide, or genes, or rainfall, or latitude; correlation and regression do not automatically indicate cause and effect. Guidelines in reporting these tests are available (5).
The form of linear regression analysis we have just applied is almost universal, but is not always appropriate. Correlation considers variation in both measures by relating the pair of values in each set to their distance from the mean of the measures (Figure 4A). However, linear regression generally only considers variation in the dependent variable (plotted on the Y axis) and fits a line to minimize the difference between (in our example) jump length observed and jump length predicted (Figure 4B). With global positioning satellites, it may be justifiable to believe that we can estimate longitude exactly enough, but what if we had been using a sextant and a chronometer to measure longitude? We would then be less certain of the accuracy of longitude, and should use analysis in which both X and Y values can be considered to be variable. Unfortunately, biological signals may vary in both values, particularly if measurements are being compared. In such circumstances an alternative means of linear regression should be used (6).
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
G.B.D. and S.L.V. prepared figures; G.B.D. and S.L.V. drafted manuscript; G.B.D. and S.L.V. edited and revised manuscript; G.B.D. and S.L.V. approved final version of manuscript.
Footnotes

This article is covered by a nonexclusive license between the authors and the Nutrition Society (London, UK) and is being simultaneously published in 2011 in The Journal of Physiology, Experimental Physiology, British Journal of Pharmacology, Advances in Physiology Education, Microcirculation, and Clinical and Experimental Pharmacology and Physiology as part of a collaborative initiative among the societies that represent these journals.
 Copyright © 2012 the American Physiological Society
Licensed under Creative Commons Attribution CCBY 3.0: the American Physiological Society.