## Abstract

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This fourth installment of *Explorations in Statistics* explores the bootstrap. The bootstrap gives us an empirical approach to estimate the theoretical variability among possible values of a sample statistic such as the sample mean. The appeal of the bootstrap is that we can use it to make an inference about some experimental result when the statistical theory is uncertain or even unknown. We can also use the bootstrap to assess how well the statistical theory holds: that is, whether an inference we make from a hypothesis test or confidence interval is justified.

- Central Limit Theorem
- R
- sample mean
- software
- standard error

this fourth paper in *Explorations in Statistics* (see Refs. 3–5) explores the bootstrap, a recent development in statistics (8, 11–16) that evolved from the jackknife (25, 28, 36). Despite its brief history, the bootstrap is discussed in textbooks of statistics (26) and used in manuscripts published by the American Physiological Society (2, 10, 17–24, 35, 37).

The bootstrap^{1} gives us an empirical approach to estimate the theoretical variability among possible values of a sample statistic.^{2} In our previous explorations (3–5) we calculated the standard error of the mean, our estimate of the theoretical variability among sample means, as SE{*ȳ*} = *s/**s* was the sample standard deviation and *n* was the number of observations in the sample. This computation was grounded in theory (7). In our last exploration (4) we derived a confidence interval for the population mean. This too was grounded in theory. The beauty of the bootstrap is that it can transcend theory: we can use the bootstrap to make an inference about some experimental result when the theory is uncertain or even unknown (14, 26). We can also use the bootstrap to assess how well the theory holds: that is, whether an inference we make from a hypothesis test or confidence interval is justified.

### R: Basic Operations

In the first article (3) of this series, I summarized R (29) and outlined its installation.^{3} For this exploration, there are two additional steps: download Advances_Statistics_Code_ Boot.R^{4} to your Advances folder and install the extra package boot.

To install boot, open R and then click Packages | Install package(s)… ^{5} Select a CRAN mirror^{6} close to your location and then click OK. Select boot and then click OK. When you have installed boot, you will see

`package ‘boot’ successfully unpacked and MD5 sums checked` in the R Console.

#### To run R commands.

If you use a Mac, highlight the commands you want to submit and then press (command key+enter). If you use a PC, highlight the commands you want to submit, right-click, and then click Run line or selection. Or, highlight the commands you want to submit and then press Ctrl+R.

### The Simulation: Data for the Bootstrap

In our early explorations (3–5) we drew random samples of 9 observations from a standard normal distribution with mean μ = 0 and standard deviation σ = 1. These were the observations-the data-for *samples 1*, *2*, and *1000*:
Each time we drew a random sample we calculated some sample statistics. These were the statistics for *samples 1*, *2*, and *1000*:

In contrast to our early explorations that used 1000 samples from our standard normal population, our exploration of the bootstrap uses just the 9 observations from *sample 1*:
In our previous exploration (4) we used these observations to calculate a 90% confidence interval for the population mean:

With this brief review of the observations from our first sample, we are ready to explore the bootstrap.

### The Bootstrap

In our previous exploration (4) we used the standard error of the mean, our estimate of the theoretical variability among possible values of the sample mean, to calculate a confidence interval for the mean of the underlying population. Rather than use theory (7) to develop the notion of the standard error of the mean, we used a simulation: we drew 1000 random samples from our population and-for each sample-we calculated the mean (3). When we treated those 1000 sample means as observations, we calculated their standard deviation:
*ȳ*}. The bootstrap estimates the standard error of a statistic using not a whole bunch of random samples from some theoretical population but actual sample observations.

Suppose we want to bootstrap the mean using the observations from the first sample: 0.422, 1.103, … , 1.825. How do we do this? We draw at random-with replacement^{7}-a sample of size 9 from these 9 actual observations (Table 1). We then repeat this process until we have drawn a total of *b* bootstrap samples.^{8} For each bootstrap sample, we calculate its mean. We use the notation *j* (11, 16).

Suppose we generate 10,000 bootstrap replications of the sample mean (Fig. 1). If we treat these 10,000 sample means as observations, we can calculate their average and standard deviation:
*ȳ**} describes the variability among the *b* bootstrap means and estimates the standard deviation of the theoretical distribution of the sample mean (16). The commands in *lines 85–86* of Advances_Statistics_Code_ Boot.R return these values. Your values will differ slightly.

We can do more than just calculate Ave{*ȳ**} and SD{*ȳ**}: we can assess whether the distribution of these 10,000 bootstrap means is consistent with a normal distribution. How? By using a normal quantile plot (Fig. 2). It turns out that these 10,000 bootstrap means are not consistent with a normal distribution (Fig. 3). On the other hand, 10,000 bootstrap replications of the sample mean using the observations from *sample 1000* are consistent with a normal distribution (Fig. 4).

And last, we can use bootstrap replications of the sample mean to estimate different kinds of confidence intervals for the population mean: for example, normal-theory, percentile, and bias-corrected-and-accelerated confidence intervals (16, 26).

#### Normal-theory confidence interval.

In our last exploration (4) we calculated a 100(1 − α)% confidence interval for the population mean as
*a* was
*z*_{α/2} is the 100[1 − (α/2)]th percentile from the standard normal distribution and SD{*ȳ*} is the standard deviation of the sample means,

A normal-theory confidence interval based on bootstrap replications of the sample mean is similar. We just replace *ȳ* with Ave{*ȳ**}, *a* with *a**, and SD{*ȳ*} with SD{*ȳ**}:
*a** is

Suppose we want to calculate a 90% bootstrap confidence interval for the population mean. In this situation, α = 0.10 and *z*_{α/2} = 1.645. Therefore, the allowance *a** is
*lines 196–199* of Advances_Statistics_Code_ Boot.R return these values. Your values will differ slightly.

#### Percentile confidence interval.

In the bootstrap distribution of 10,000 sample means (see Fig. 1), 90% of the means are covered by the interval
*lines 196–199* of Advances_Statistics_ Code_Boot.R return these values.

#### Bias-corrected-and-accelerated confidence interval.

If the number of observations in the actual sample is too small, if the average Ave{*ȳ**} of the bootstrap sample means differs from the sample mean *ȳ,* or if the bootstrap distribution of the sample mean is skewed, then normal-theory and percentile confidence intervals are likely to be inaccurate (16, 26). A bias-corrected-and-accelerated confidence interval adjusts percentiles of the bootstrap distribution to account for bias and skewness (16, 26). In this exploration, the 90% bias-corrected-and-accelerated confidence interval is
*lines 196–199* of Advances_Statistics_ Code_Boot.R return these values. This confidence interval is shifted slightly to the left of the percentile interval of [0.43, 1.14]. In many situations, a bias-corrected-and-accelerated confidence interval will provide a more accurate estimate of our uncertainty about the true value of a population parameter such as the mean (16, 26).

#### Limitations.

As useful as the bootstrap is, it cannot always salvage the statistical analysis of a small sample. Why not? If the sample is too small, then it may be atypical of the underlying population. When this happens, the bootstrap distribution will not mirror the theoretical distribution of the sample statistic. The trouble is, it may not be obvious how small too small is. A statistician (see Ref. 6, *guideline 1*) and an estimate of power can help.

In our second and third explorations (4, 5) we concluded that the observations from *sample 1* were consistent with having come from a population that had a mean other than 0. Only because we had defined the underlying population (see Ref. 3) could we have known that we had erred. When we do a single experiment, we can have enough unusual observations so that it just appears the observations came from a different population (5). A small sample size exacerbates the potential for this phenomenon. This is what happened when we bootstrapped the sample mean using observations from the first sample.

We know the theoretical distribution of the sample mean is exactly normal (3), but the distribution of those bootstrap means is clearly not normal (see Fig. 3). This happens because the observations from the first sample are atypical of the underlying population. The message? All bets are off if you bootstrap a sample statistic using observations from a sample that is too small.

### The Bootstrap in Data Transformation

In our previous exploration (4) we delved into confidence intervals by drawing random samples from a normal distribution. That we chose a normal distribution for our population was no accident. For normal-theory confidence intervals to be meaningful, one thing must be approximately true: if the random variable *Y* represents the physiological thing we care about, then the theoretical distribution of the sample mean *Ȳ* with *n* observations must be distributed normally with mean μ and standard deviation ^{9} In our simulations (3–5) this assumption was satisfied exactly (7). In a real experiment we never know if this assumption is satisfied, but we know it will be satisfied at least roughly-regardless of the population distribution from which the sample observations came-as long as the sample size *n* is big enough (27).

What happens, however, if we doubt that the theoretical distribution of the sample mean is consistent with a normal distribution? Our suspicion would be natural: this distribution is, after all, a theoretical one. One approach is to transform the sample observations. Common transformations include the logarithm and the inverse. Box and Cox (1) described a family of power transformations in which an observed variable *y* is transformed into the variable *w* using the parameter λ:

Draper and Smith (9) summarized the steps needed to estimate λ and its approximate 100(1 − α)% confidence interval. These steps include, for each trial value of λ, the calculation of the maximum likelihood *ℓ*_{max}:
_{residual} is the residual sum of squares in the fit of a general linear model to the observations. The optimum estimate of λ maximizes *ℓ*_{max} (Fig. 5).

Without question, transformation can be useful (1, 9). But really, who wants to identify the optimum transformation by solving *Eq. 1*? The bootstrap provides another way.

The textbook I use in my statistics course provides the data: measurements of C-reactive protein (Table 2). *Problem 7.26* (Ref. 26, p. 442–443) asks students to study the distribution of these 40 observations and to calculate a 95% confidence interval for the true C-reactive protein mean. The real value of *problem 7.26* is that it then asks students if a confidence interval is even appropriate for these data.

Cursory examinations of the histogram and normal quantile plot reveal that the C-reactive protein values are skewed and inconsistent with a normal distribution (Fig. 6, *top).* Still, if the theoretical distribution of the sample mean is roughly normal-if the Central Limit Theorem holds-then the 95% confidence interval [4.74, 15.33] mg/l will be meaningful. But we have an intractable problem: we have no way of knowing if the sample size of 40 is big enough for the theoretical distribution of the sample mean to be roughly normal.

*Problem 7.27* (Ref. 26, p. 443) tells students a log transformation^{10} decreases skewness and asks them to transform the actual C-reactive protein observations by adding 1 to each observation and taking the logarithm of that number:
*Problem 7.27* then asks students to study the distribution of the 40 transformed observations and to calculate a 95% confidence interval for the true transformed C-reactive protein mean.

A histogram and normal quantile plot show that the transformed values are less skewed but still inconsistent with a normal distribution (Fig. 6, *bottom).* As my students tell me, “Better, but not great.” The 95% confidence interval [1.05, 1.94] reverts to

At this point, we have another problem: we have assumed the log transformation ln (*y* + 1) is useful-that it gives us a meaningful confidence interval-but what evidence do we have that it really is? You guessed it: the bootstrap.

A bootstrap distribution estimates the theoretical distribution of some sample statistic. In this problem, the bootstrap sample means from the actual observations are inconsistent with a normal distribution (Fig. 7, *top*). This means the sample size of 40 is not big enough for the theoretical distribution of the sample mean to be roughly normal; as a result, the confidence interval [4.74, 15.33] mg/l is misleading. On the other hand, the bootstrap sample means from the transformed C-reactive protein observations are consistent with a normal distribution (Fig. 7, *bottom*): the confidence interval [1.05, 1.94] ≐ [1.86, 5.96] mg/l is a useful tool for inference.

### Summary

As this exploration has demonstrated, the bootstrap gives us an approach we can use to assess whether an inference we make from a normal-theory hypothesis test or confidence interval is justified: if the bootstrap distribution of a statistic such as the sample mean is roughly normal, then the inference is justified. But the bootstrap gives us more than that. We can use a bootstrap confidence interval to make an inference about some experimental result when the statistical theory is uncertain or even unknown (14, 26). Although we have explored the bootstrap using the sample mean, we can use the bootstrap to make an inference about other sample statistics such as the standard deviation or correlation coefficient.

In the next installment of this series, we will explore power, a concept we mentioned in our exploration of hypothesis tests (5). Power is the probability that we reject the null hypothesis given that the null hypothesis is false. The notion of power is integral to hypothesis testing, confidence intervals, and grant applications.

## ACKNOWLEDGMENTS

I thank John Ludbrook (Department of Surgery, The University of Melbourne, Victoria, Australia) and Matthew Strand (National Jewish Health, Denver, CO) for giving helpful comments and suggestions, Sarah Kareem (Department of English, University of California, Los Angeles, CA) for humoring my questions and for graciously providing a lot of information about Baron Munchausen, and Bernhard Wiebel (Munchausen Library, Zurich, Switzerland) for searching more than 80 English editions of the Baron's adventures for mention of bootstraps.

## APPENDIX

In 1993, Efron and Tibshirani (16) wrote that *bootstrap*, a term coined originally by Efron in 1979 (11), was inspired by the colloquialism *pull yourself up by your bootstraps*, a nonsensical notion generally attributed to an escapade of the fictitious Baron Munchausen (30). When I learned this, I embarked on a search for the escapade. With the generous assistance of Dr. Sarah Kareem (Department of English, University of California, Los Angeles, CA) and Bernhard Wiebel (Munchausen Library, Zurich, Switzerland), this is what I discovered.

The complete title of the Baron's adventures (30), written by Rudolf Erich Raspe, was *Baron Munchausen's Narrative of His Marvellous Travels and Campaigns in Russia. Humbly Dedicated and Recommended to Country Gentlemen; and, If They Please, to be Repeated as Their Own, After a Hunt at Horse Races, in Watering-Places, and Other Such Polite Assemblies; Round the Bottle and Fire-Side.*

In chapter VI, the Baron throws a silver hatchet at two bears in hopes of rescuing a bee.^{11} The hatchet misses both bears and ends up on the moon. To retrieve the hatchet, the Baron grows and promptly climbs a Turkey-bean after it had attached itself to the moon. He finds his hatchet but then discovers the bean has dried up, so he braids a rope of straw to use for the descent. The Baron is partway down the straw rope when the next catastrophe hits. The details of the Baron's escape differ according to the account you happen to read:
I was still a couple of miles in the clouds when it broke, and with such violence I fell to the ground that I found myself stunned, and in a hole nine fathoms under grass, when I recovered, hardly knowing how to get out again. There was no other way than to go home for a spade and to dig me out by slopes, which I fortunately accomplished, before I had been so much as missed by the steward.

Refs. 30 (1785), 33 (1948), and 34 (1952)

I was four or five miles from the earth at least, when it broke; I fell to the ground with such amazing violence, that I found myself stunned, and in a hole nine fathoms deep at least, made by the weight of my body falling from so great a height: I recovered, but knew not how to get out again; however, I dug slopes or steps with my [finger] nails (the Baron's nails were then of forty years' growth), and easily accomplished it.

Refs. 31 (1786) and 32 (2001)

It was from this 9-fathom-deep hole that the Baron is rumored to have extricated himself by his bootstraps.

Although the notion of pulling yourself up by your bootstraps is entirely consistent with Baron Munchausen's flair for the dramatic, I failed to find any evidence that the Baron availed himself of this life-saving technique.

## Footnotes

↵1 The appendix reviews the origin of the name

*bootstrap.*↵2 For example, a sample mean, a sample standard deviation, or a sample correlation.

↵3 I developed the scripts for the early explorations (3–5) using R-2.6.2. I developed the script for this exploration using R-2.8.2 (deployed 22 Dec 2008), but it will run in R-2.6.2.

↵4 This file is available through the Supplemental Material for this article at the

*Advances in Physiology Education*website.↵5 The notation

*click*A | B means*click*A, then*click*B.↵6 CRAN stands for the Comprehensive R Archive Network. A mirror is a duplicate server.

↵7 After an observation is drawn from the pool of 9 values, it is returned to the pool. The consequence: the observation can appear more than once in a bootstrap sample.

↵8 The number of bootstrap replications can vary from 1,000 to 10,000 (11, 16, 26).

↵9 In our second exploration (5) we used the test statistic

*t*to investigate hypothesis tests, test statistics, and*P*values. A*t*statistic shares this assumption.↵10 The Box and Cox method identifies a log transformation of the C-reactive protein values as the optimal transformation (see Fig. 5).

↵11 It is a long story.

- Copyright © 2009 the American Physiological Society