## Abstract

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This seventh installment of *Explorations in Statistics* explores regression, a technique that estimates the nature of the relationship between two things for which we may only surmise a mechanistic or predictive connection. Regression helps us answer three questions: does some variable *Y* depend on another variable *X*; if so, what is the nature of the relationship between *Y* and *X*; and for some value of *X*, what value of *Y* do we predict? Residual plots are an essential component of a thorough regression analysis: they help us decide if our statistical regression model of the relationship between *Y* and *X* is appropriate.

- Sir Frances Galton
- R
- residual analysis
- residual plots
- George Udny Yule

this seventh paper in *Explorations in Statistics* (see Refs. 3–8) explores regression, a technique–like correlation–that can estimate the magnitude of a straight-line relationship between two variables. Unlike correlation, regression can also estimate more complicated relationships between two variables.

Some relationships are not estimated using regression but are defined by physical laws. Two classic examples: Einstein's equation *E* = *mc*^{2} which defines the relationship among the energy *E* and the mass *m* of some body and the speed of light *c* (13) and Poiseuille's law
*r* and length *l* to be proportional to the pressure drop ΔP across that length *l* (29).

In physiology, we are often interested in exploring the nature of the relationship between two things for which we cannot define a physical connection: we may want to investigate the relationship between alveolar surface area and body weight (33) or the relationship between venous capacitance and mean circulatory filling pressure in trout (22). Regression helps us do that.

### A Brief History of Regression

In our last exploration (7), we learned that Galton used regression in order to understand heredity (14–17). Galton conceived of regression, however, not as a tool with which to estimate the relationship between two variables but as a concept to account for the observation that variability in some characteristic–for example, the size of sweet peas–was stable from one generation to the next (7, 27, 28). Just as Karl Pearson refined correlation, so too did George Udny Yule advance regression (27, 28, 30–32).

In his early papers (30, 31), Yule wrote explicitly about solving his regression problems with a technique that minimized the sum of squared error terms: he used the method of least squares. The method of least squares, popularized by Legendre in 1805 (20),^{1} had been used as a tool that combined observations in order to determine the orbits of comets or the center of gravity of several celestial bodies (18, 23, 27, 28). It was the perfect complement to regression.

With this brief history, we are almost ready to begin our exploration of regression. First, we need to review the software we will use to help us learn about regression.

### R: Basic Operations

In the first paper (3) of this series, I summarized R (24) and outlined its installation. For this exploration, there is just one additional step: download Advances_Statistics_Code_Regr.R^{2} to your Advances folder.

#### To run R commands.

If you use a Mac, highlight the commands you want to submit and then press (command key+enter). If you use a PC, highlight the commands you want to submit, right-click, and then click Run line or selection. Or, highlight the commands you want to submit and then press Ctrl+R.

### Regression: an Overview

Suppose we want to explore the nature of the relationship between *Y*, the physiological thing^{3} we care about, and some variable *X*, a factor we believe might impact *Y*. To do this, we must posit a provisional idea–a statistical model–of the relationship between *Y* and *X* .

The most basic statistical model of the relationship between *Y* and *X* is a straight line, written formally as
_{0} represents the *Y* intercept, the value of *Y* when *X* = 0, β_{1} represents the slope of the straight-line relationship between *Y* and *X*, the amount *Y* changes when *X* increases by 1 unit, and ε represents random error (Fig. 1). A more complicated model is
_{2}*X*^{2} term gives curvature to the relationship between *Y* and *X* (see Fig. 1).^{4}

Despite the curved contour depicted by the second model (*Eq. 2*), each of these models defines a problem in linear regression: the adjective *linear* refers not to the contour of the relationship between *Y* and *X* but to the role of the coefficients β_{0}, β_{1}, and β_{2}. In my course, I say that in linear regression there is nothing fancy about the role of the coefficients.

In nonlinear regression, however, the coefficients play a role other than the generic one typified by *Eqs. 1* and *2*. The model

To further simplify our lives, suppose the true relationship between *Y* and *X* is a straight line (see *Eq. 1*). In our early explorations (3–5), we drew at random observations from our population in order to estimate μ, the mean of that population. In regression, we want to estimate the true relationship between *Y* and *X*: that is, we want to estimate the coefficients β_{0} and β_{1}. To do this, we choose levels of *X* and measure the response *Y*. Just as our sample observations differed because the underlying population was distributed over a range of possible values (3–5), so also do values of *Y* at each level of *X* (Fig. 2).

Suppose the true relationship between *Y* and *X* is defined by the straight line
_{0} = 0, β_{1} = 1, and–at each level of *X*–the error ε is distributed normally with mean μ = 0 and standard deviation σ = 1 (Fig. 3). If we use regression in R to estimate the true relationship between *Y* and *X*, then we obtain
*Eq.* 3,
*b*_{0} and *b*_{1} estimate the population parameters β_{0} and β_{1}. Your values will differ slightly.

Now the question is, how on earth did R produce the estimates *b*_{0} = −1.31 and *b*_{1} = 1.27? The answer: by the method of least squares. For *observation i*, the error *e*_{i} is the difference between the observed value *y*_{i} and the predicted value
^{5} for *observation i* is
*b*_{0} and *b*_{1} such that the squared error for all *n* observations,
*ȳ* and *x̄* are the average values of *y* and *x* for the *n* observations.

At this point, we have estimated the true relationship between *Y* and *X*: when *X* increases by 1 unit, *Y*, the physiological thing we really care about, increases by 1.27 ≐ 1.3 units. Before the experiment, we would have constructed the null and alternative hypotheses, *H*_{0} and *H*_{1}, as
*H*_{0}: There is no relationship between *Y* and *X*.

*H*_{1}: There is a straight-line relationship between *Y* and *X*.

In our exploration of hypothesis tests (5), we discovered that a test statistic compares what we observe in some experiment to what we expect if the null hypothesis is true. How do we adapt this concept to regression? By using sums of squares.

For *observation i*, the distance between its measured value and the sample mean, *y*_{i} − *ȳ,* has two segments: the distance between the measured value and the value predicted by regression, *y*_{i} − ŷ_{i}, and the distance between the value predicted by regression and the sample mean, ŷ_{i} − *ȳ* (Fig. 4). We can write this as
*n* observations, the total sum of squares,

The regression sum of squares represents variation in the response that is accounted for by the regression model. The residual sum of squares represents variation in the response that is unaccounted for: it represents the error of the regression model. Each sum of squares has associated with it some number of degrees of freedom: *k*, the number of regression coefficients other than β_{0}, and *n* − *k* − 1, the error degrees of freedom. When we divide a sum of squares by its degrees of freedom, we get a mean square.^{6}

One test statistic with which we can assess whether our observations are consistent with no relationship between *Y* and *X* is *F* which, in this situation, is the ratio of the regression mean square to the residual mean square:
*Y* and *X*.

In our simulation,

How do we interpret an *F* value of 102.98? We interpret it within the context of a true null hypothesis: if the null hypothesis is true–if there is no relationship between *Y* and *X*–how usual is this value of *F*? The answer is, not very. If the null hypothesis is true, we expect to observe a value of *F* at least as big as 102.98 just 8 times in 1,000,000 (*P* = 0.000008).^{7} Your values will differ slightly.

We now have convincing statistical evidence of a relationship between *Y* and *X*. As satisfying as that might be, we really want to ask, is the relationship between *Y* and *X* of potential scientific importance (4, 7, 9, 11)? One way we can answer that question is to compute the statistic *R*^{2},

In our simulation,
*Y*. Whether you are impressed by a relationship that accounts for 93% of the variation in some response will depend, in part, on scientific context.

Although we have explored the concepts of statistical significance and scientific importance using a straight-line relationship between *Y* and *X* (see *Eq. 1*), these concepts apply to any kind of problem in linear regression.

### A Classic Example in Regression

Now that we have a sense of how to assess the estimated relationship between *Y* and *X*, we might think we are pretty much done with our exploration of basic linear regression. We are not.

Imagine this scenario: a neurological syndrome results from impaired production of some neurotransmitter.^{8} *Drugs A* and *B*, derived from the same parent compound, boost production of this neurotransmitter. One of the drugs stimulates neurotransmitter production over its entire therapeutic range. At higher doses, the second drug becomes less effective and causes neurotoxicity. Table 1 lists the drug concentrations, *x*, and the measured increases in neurotransmitter production, *y*. If we rely solely on the regression statistics in Table 1, which drug is which? If we are unfortunate and happen to have this hypothetical syndrome, then our choice assumes added importance.

From the regression statistics alone, we cannot differentiate the drugs. Their identities are plain when the data are plotted (Fig. 5). *Drug A* increases neurotransmitter production over the entire range of drug concentrations. The increase in neurotransmitter production begins to fall at higher concentrations of *drug B*.

It is clear that a scatterplot of the original observations (see Fig. 5) is essential to a careful regression analysis. Another set of data graphics is also essential: residual plots.

### Residual Plots

Residual plots help us decide if our provisional statistical model of the relationship between *Y* and *X* is appropriate (12). For *observation i*, the residual *e*_{i} is the difference between the observed value *y*_{i} and the value
*Eq. 4*). Typical residual plots include the residuals plotted against
*x,* the values of some predictor variable. If our statistical model is appropriate, then there is no obvious pattern to the residuals (Fig. 6).

Residual plots confirm that the first-order model
*drug A* (Fig. 7) but inappropriate for *drug B* (Fig. 8).

### Practical Considerations

As we have just seen, in a typical regression analysis, we choose the levels of *X* and measure the response *Y*. In contrast, we discovered that correlation is useful only when the predictor variable *X* varies randomly (7). This raises a practical question: what happens if we want to estimate a straight-line relationship between *Y* and *X* in a situation where we allow *X* to vary randomly or in a situation where there is measurement error in *X*? If we use the regression approach we have explored here, we will obtain misleading estimates of the slope β_{1}, the amount *Y* changes when *X* increases by 1 unit, and *R*^{2}, the proportion of variation in the response *Y* that is accounted for by the fitted regression equation (see Fig. 9 and the appendix). That is the bad news. The good news is that all is not lost: there are regression techniques we can use to estimate the relationship between *Y* and *X* when we do not control *X* (see Refs. 21 and 26).

### Summary

As this exploration has demonstrated, regression estimates the nature of a relationship between variables even when we can only guess at the actual connection between them. Regression helps us answer these questions (25):
Does some variable *Y* depend on another variable *X*?

If so, what is the nature of the relationship between *Y* and *X*?

For some value of *X*, what value of *Y* do we predict?

In my statistics course, I announce that there are four rules for any statistical analysis:

*1.*Plot the data.*2.*Study the data.*3.*Analyze the data.*4.*Analyze the analysis.

In regression, *rules* 1 and 2 provide a sense of our provisional statistical model.^{9} *Rule 3* is the regression analysis itself. *Rule* 4 is the residual analysis: the residual plots. Residual plots help us decide if our provisional statistical model is appropriate; they are essential to a thorough regression analysis.

In the next installment of this series, we will explore permutation tests. A permutation test, like the bootstrap (6), provides an empirical approach with which we can make an inference about some experimental result when the statistical theory is uncertain or even unknown.

## APPENDIX

If the observed value of some predictor variable *X* includes measurement error, then our estimate of the relationship between the response *Y* and *X* will be affected. Although the actual nature of the effect is complex (2), we can see one facet of the phenomena using the first-order model
*Eq. 1*).

Suppose the predictor variable *X* is distributed normally with some mean and a standard deviation σ_{X}. Suppose also that the random error associated with the measurement of *X* is distributed normally with a mean of 0 and a standard deviation σ_{ξ}*.* In this setting, the statistical model of the relationship between Y and *X* is not *Eq. 1* but
_{0} but where
*Y* and *X* will be diminished.

If *X* includes measurement error, then there is also more variability in the observed values of the response *Y*. At any measured value of *X*, the variance of the response *Y*, Var{*Y*}, is

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

## ACKNOWLEDGMENTS

I dedicate this exploration to the memory of Dale Benos. Although Dale and I never met, we talked as if we had known each other for years. I am pleased that my name will remain linked with Dale's by virtue of our collaborations and by our efforts to improve the use and reporting of statistics within physiology.

I thank Gerald DiBona (University of Iowa College of Medicine, Iowa City, IA) and John Ludbrook (Department of Surgery, The University of Melbourne, Melbourne, Victoria, Australia) for their helpful comments and suggestions.

## Footnotes

↵1 In 1822, Harvey published the first English translation of Legendre's treatise (19).

↵2 This file is available through the Supplemental Material link for this article at the

*Advances in Physiology Education*website.↵3 For example, l-ascorbic acid transport, differential gene expression, TNF-α, or venous capacitance in trout (see Ref. 3).

↵4

*Equations 1*and*2*are known also as first- and second-order polynomial models.↵5 This error is also called the residual.

↵6 In statistical output, sums of squares are often abbreviated SS.

↵7 If we wanted to report this

*P*value, we would report it as*P*< 0.001 (10).↵8 I have used this scenario before (11), and I use it in my course.

↵9 Had we considered the scatterplot for

*drug B*(see Fig. 5), we would have dismissed a first-order model for these data.

- Copyright © 2011 The American Physiological Society