|
|
||||||||
STAYING CURRENT
Division of Biostatistics and Bioinformatics, National Jewish Health, and Department of Preventive Medicine and Biometrics and Department of Physiology and Biophysics, School of Medicine, University of Colorado Denver, Denver, Colorado
Address for reprint requests and other correspondence: D. Curran-Everett, Div. of Biostatistics and Bioinformatics, M222, National Jewish Health, 1400 Jackson St., Denver, CO 80206 (e-mail: EverettD{at}njc.org)
Abstract
Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This series in Advances in Physiology Education provides an opportunity to do just that: we will investigate basic concepts in statistics using the free software package R. Because this series uses R solely as a vehicle with which to explore basic concepts in statistics, I provide the requisite R commands. In this inaugural paper we explore the essential distinction between standard deviation and standard error: a standard deviation estimates the variability among sample observations whereas a standard error of the mean estimates the variability among theoretical sample means. If we fail to report the standard deviation, then we fail to fully report our data. Because it incorporates information about sample size, the standard error of the mean is a misguided estimate of variability among observations. Instead, the standard error of the mean provides an estimate of the uncertainty of the true value of the population mean.
Key words: R; software; uncertainty; variability
IN 2004, Dale Benos and I proposed concise guidelines for reporting statistics (12).1 We included a brief explanation or example to clarify each guideline, and we provided additional resources readers could use in concert with the guidelines. Nevertheless, we recognized that the guidelines–even with explanations and examples–were inadequate for readers who wanted to learn about fundamental concepts in statistics. Why are fundamental concepts in statistics important? They form the cornerstone of scientific inquiry. If we fail to understand these fundamental concepts, then the scientific conclusions we reach are more likely to be wrong. And wrong conclusions based on faulty reasoning is shoddy science (14).
Two series have been published in an effort to help readers learn about statistics: Altman's "Statistics and Ethics in Medical Research" (1–7) in the British Medical Journal and Healy's "Statistics from the Inside" (18–33) in the Archives of Disease in Childhood. These series are helpful, but, as you can imagine, they are entirely didactic. What do you expect, right? The problem is that most of us learn statistics like we learn science: by doing it.
When I teach and write about statistics, I want to engage my audience. To do this I use simulations as thought experiments my audience can see (11–14). From my perspective, the only thing better would be if my audience could run the simulations on their own. This series in Advances in Physiology Education provides an opportunity to do just that: we will investigate basic aspects of statistics using a free software package. Among the concepts we will examine in future installments are the concepts behind P values and confidence intervals. My goals are to provide a theoretical framework for and a vehicle with which to illustrate each concept.
In this inaugural paper we explore the distinction between standard deviation and standard error, a distinction that has been discussed already (8, 10, 12–14, 16, 17). Before we begin that exploration, however, we need to learn a little about the software we will use to help us learn about concepts in statistics.
R: Software to Explore Concepts
In my statistics course, I use the freeware package R (34). R is a system–a language and an environment–for statistical analysis and data graphics. The R environment is a command–line environment in which > represents the command line. You can submit an R command in two ways: type the command in the interactive R Console, or submit the command from a script.2
Because this series relies on R merely to explore fundamental concepts in statistics, I provide a script of the requisite R commands.
Regardless of whether you use a Mac or a PC, there are three preliminary steps to perform:
Installation and basic operations.
If you use a Mac, download R from
http://cran.us.r-project.org/bin/macosx/.After you have installed R, double–click on Advances_Statistics_Code.R to open it. To submit particular commands in Advances_Statistics_Code.R, highlight the commands you want to submit and then press
If you use a PC, download R from
http://cran.us.r-project.org/bin/windows/base/.After you have installed R, a shortcut for R will exist on your Desktop. To simplify the process of starting R from within your Advances folder, move this shortcut into your Advances folder, right-click on the shortcut, and then click on Properties. Paste the full address (path) of your Advances folder
C:\Documents and Settings\...\Desktop\Advancesinto the Start in: location (Fig. 1) and then click OK. Now double-click on the R shortcut to open R. To open Advances_Statistics_Code.R, click File|Open script...4 or click the Open script icon
|
Basic syntax.
The script Advances_Statistics_Code.R contains comments in addition to commands. Comments define sections of the script and explain many of the commands. The character # denotes a comment: all text after the first # on a line is a comment. For example, lines 5–8 of Advances_Statistics_Code.R are
# --Define population parameters and sample numbers------
#
PopMean <- 0 # population mean
PopSD <- 1 # population standard deviation
The two commands in these lines of code, PopMean <- 0 and PopSD <- 1, assign values to the variables PopMean and PopSD, the population mean and standard deviation.
If you highlight and then submit these lines of code, this is what you see in the R Console:
> # --Define population parameters and sample numbers------
> #
> PopMean <- 0 # population mean
> PopSD <- 1 # population standard deviation >
It is not obvious the commands have done a thing. If you type and then submit each variable name5 in the R Console, however, you see that the commands have assigned the values of 0 and 1 to PopMean and PopSD:
> PopMean
[1] 0
> PopSD [1] 1
>
The Simulation: Observations and Sample Statistics
If we want to explore the distinction between standard deviation and standard error, we need some data. When I teach a class on regression, I introduce a data set in this way:
It is difficult to choose an example that is relevant to everyone. So instead, I want to use an example that is relevant to no one: cement.I then proceed to discuss a 1932 study that examined the impact of the composition of cement on the heat released by the cement as it hardened (15).
Suppose the random variable Y represents not the heat from cement but the physiological thing you study: L-ascorbic acid transport, differential gene expression, TNF-
, or venous capacitance in trout. Assume that your Y is distributed normally with mean µ and standard deviation
. We now have an example that is relevant to everyone. Unfortunately, we now also have a problem: different responses have different means and standard deviations. We can circumvent this problem if we consider the distribution of each response to be a standard normal distribution with mean µ = 0 and standard deviation
= 1 (Fig. 2). This standard normal distribution is the population from which we will obtain our simulated sample observations-our data.
|
= 1, the mean and standard deviation of our population (see Ref. 14). To do this, we draw at random a sample of n observations from the population. For simplicity, suppose we limit the sample to nine observations. This is the R command (Advances_Statistics_Code.R, line 36) that generates the sample and rounds each value to three decimal places:
![]() |
The sample size is defined by the command nObs <- 9 (Advances_Statistics_Code.R, line 10).
Because we had so much fun taking 1 random sample, we repeat the process until we have drawn a total of 1000 random samples, each with 9 observations, from our population. Mercifully, the command for (i in 1:nSamples) in line 35 of Advances_Statistics_Code.R does this for us. These are the observations-the data-for samples 1, 2, and 1000: > # Sample Observations
Your sample observations will differ.
We have our data, but if we want to really understand the distinction between standard deviation and standard error, we also need some sample statistics.6
So each time we draw a sample of 9 observations, we calculate the sample statistics listed in Table 1. These are the statistics for samples 1, 2, and 1000:
|
With these 1000 sets of sample observations and statistics, we are ready to explore the essential distinction between standard deviation and standard error.
Standard Deviation
In each of our 1000 samples, the 9 observations differ because the underlying population (see Fig. 2) is distributed over a range of possible values. The typical measure of the variability among experimental measurements is the sample standard deviation s:
![]() |
where n is the number of observations in the sample, yi is an individual observation, and
is the sample mean. The sample standard deviation characterizes the dispersion of observations about the sample mean and estimates the population standard deviation
. For example, the standard deviation of the observations in sample 1, 0.422, 1.103,..., 1.825, is s = 0.702, which estimates
= 1. The empirical distribution of the 1000 sample standard deviations is centered at 0.966, slightly less than the actual value of 1 (Fig. 3). The command in line 104 of Advances_Statistics_Code.R returns this value. Your value will differ slightly.
|
|
}? I ask this question of my students on the first day of class. Often students can explain in words how to calculate the standard error: divide the standard deviation by the square root of the sample size. Seldom can a student explain the concept behind the standard error: if I repeat an experiment a whole bunch of times–and each time I calculate a sample mean-then the standard deviation of those sample means will be the standard error of the mean (12–14). The standard error of the mean answers a theoretical question: if I repeat an experiment a whole bunch of times, by how much will a typical sample mean differ from the population mean? By virtue of our simulation, we have 1000 sample means (Fig. 5). If we treat these 1000 sample means as observations, we can calculate their average and standard deviation:
![]() |
|
Suppose we draw from our population (see Fig. 2) an infinite number of samples, each with n = 9 observations. The infinite number of sample means,
1,
2,..., 
, will be distributed normally with mean µ and standard deviation
/
.8
In other words, the average of the sample means, Ave {
}, will be the population mean µ, but the standard deviation of the sample means, SD {
}, will be smaller than the population standard deviation
by a factor of
/
:
![]() |
If the sample size n increases, then the standard deviation of the theoretical distribution of the sample mean will decrease: the more sample observations we have, the more certain we will be that the sample mean
is near the actual population mean µ (Fig. 6).
|
}.
Summary
As this exploration has demonstrated, the standard deviation and standard error of the mean estimate quite different things: a standard deviation estimates the variability among individual observations in a sample–it can also estimate the variability in the underlying population–but a standard error of the mean estimates the theoretical variability among sample means.
In a sample, the observations–the data–differ because the population from which they were drawn is distributed over a range of possible values. The standard deviation describes the dispersion of these sample observations about the sample mean. If we fail to report the standard deviation, then we fail to fully report our data. Because it incorporates information about sample size, the standard error of the mean is a misguided estimate of variability among observations. Instead, the standard error of the mean provides an estimate of the uncertainty of the true value of the population mean.
In the next installment of this series, we will explore some of the concepts behind hypothesis testing: test statistics and P values.
APPENDIX
The probability density function–the theoretical distribution of possible values–of the sample standard deviation s is
![]() | (A1) |
where n* = (n – 1)/2 and the gamma-function
(n*) is
![]() |
If n* is a positive integer greater than 1, then
(n*) = (n* – 1)!. For example,
(4) = 3·2 = 6.
Figure 7 depicts the probability density function of the sample standard deviation for 5, 10, 20, 30, 40, 50, and 100 observations.
|
Acknowledgments
I thank Matthew Strand (National Jewish Medical and Research Center, Denver, CO) for deriving the probability density function for the sample standard deviation.
Footnotes
The costs of publication of this article were defrayed in part by the paymentof page charges. The article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
1 These guidelines can be accessed through the American Physiological Society "Information for Authors" (9). ![]()
2 A script is a file composed of R commands. ![]()
3 This file is available through the Supplemental Material link for this article on the Advances in Physiology Education website. ![]()
4 The notation click A|B means click A, then click B. ![]()
5 To do this, click within the R Console, type the variable name, and then press Enter. ![]()
6 A statistic is a quantity calculated from the sample observations. ![]()
7 We will use the statistics in columns 4–7 in subsequent explorations. ![]()
8 I derive these results for the mean and standard deviation in Ref. 14. The Central Limit Theorem states that the theoretical distribution of the sample mean will be approximately normal regardless of the distribution of the original observations. If the distribution of the original observations happens to be normal, then the theoretical distribution of the sample mean will be exactly normal. ![]()
Received for publication March 21, 2008. Accepted for publication May 6, 2008.
REFERENCES
This article has been cited by other articles:
![]() |
D. Curran-Everett and D. J. Benos Reply to B. Kay Advan Physiol Educ, December 1, 2008; 32(4): 335 - 335. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |