## Abstract

Learning about statistics is a lot like learning about science: the learning is more meaningful if you can actively explore. This eighth installment of *Explorations in Statistics* explores permutation methods, empiric procedures we can use to assess an experimental result–to test a null hypothesis–when we are reluctant to trust statistical theory alone. Permutation methods operate on the observations–the data–we get from an experiment. A permutation procedure answers this question: out of all the possible ways we can rearrange the observations we got, in what proportion of those arrangements is the sample statistic we care about at least as extreme as the one we got? The answer to that question is the *P* value.

- permutation test
- R
- randomization test
- Sir Ronald Fisher

this eighth paper in *Explorations in Statistics* (see Refs. 4–10) explores permutation methods, basic procedures for hypothesis testing against which more familiar methods of hypothesis testing have been validated (1, 20, 29, 46–48, 55). Although the very phrase may be novel to many researchers, permutation methods are detailed in textbooks of statistics (3, 16, 24, 36, 38) and used in reports published by the American Physiological Society (2, 23, 25, 30, 32, 37, 42, 43, 45, 50, 52).

Permutation methods^{1} give us an empiric approach to estimate the theoretic distribution of some statistic and to get the *P* value associated with that statistic. When we explored correlation (8) we calculated a sample correlation coefficient of −0.26 for which the associated *P* value was 0.009. We glossed over it, but we derived that *P* value from a theoretic distribution of the test statistic *t* (see Ref. 8, *footnote 7*). Like the bootstrap, permutation methods can transcend theory: we can use them to assess the result of an experiment when the theoretic distribution of some test statistic may mislead us. Some statisticians advocate that we use permutation methods whenever we can (17, 18, 24, 28, 35, 53).

### A Brief History of Permutation Methods

In 1935, 10 years after he published *Statistical Methods for Research Workers* (19), Sir Ronald Fisher used Charles Darwin's measurements of the heights of cross- and self-fertilized corn (14) to illustrate the analysis of *quantitative measures* using a *t* test (20).^{2} For 15 pairs of plants, Fisher calculated the difference in height between the cross- and self-fertilized member of each pair.^{3} He defined the null hypothesis *H*_{0} as
^{4} Or, from a practical standpoint, the heights of cross- and self-fertilized corn plants are the same. Using the 15 differences, Fisher calculated *t* = 2.148 (see Ref. 6). If the null hypothesis is true, we expect to observe a value of |*t*| at least as big as 2.148 about 1 time in 20 (*P* = 0.0497 ≐ 0.05). Fisher concluded that cross-fertilized corn is barely taller than self-fertilized corn.

When we explored the bootstrap (7) we discovered that the result of a *t* test is meaningful only if the theoretic distribution of the sample mean is roughly normal. One way this can happen is if our observations are drawn from a population that is distributed normally (4). In *The Design of Experiments* (20), Fisher mused about the extent to which his conclusion would be impacted if the observed differences came from something other than a normal distribution:

It has been mentioned that Student's

ttest … is appropriate to the null hypothesis that the two groups of measurements are samples drawn from the same normally distributed population … There has, however, in recent years, been a tendency for theoretical statisticians, not closely in touch with the requirements of experimental data, to stress the element of normality…as though it were a serious limitation to the test applied. It is, indeed, demonstrable that…the exactitude of Student'sttest is absolute. It may, nevertheless, be legitimately asked whether we should obtain a materially different result were it possible to test the wider hypothesis which merely asserts that the two series are drawn from the same population, without specifying that this is normally distributed.In these discussions it seems to have escaped recognition that the physical act of randomisation, which, as has been shown, is necessary for the validity of any test of significance, affords the means, in respect of any particular body of data, of examining the wider hypothesis in which no normality of distribution is implied. The arithmetical procedure of such an examination is tedious, and we shall only give the results of its application in order to show the possibility of an independent check on the more expeditious methods in common use.

Fisher then proceeded to detail the results of his tedious check: in 1,726 of the 32,768 possible ways of arranging the heights was the sum of the differences at least as extreme as the one Darwin actually observed. Therefore, *P* = 1,726/32,768 = 0.05267 ≐ 0.05,

a result very nearly equivalent to that obtained using the

ttest with the hypothesis of a normally distributed population.

In the 1930s, statisticians used permutation methods to validate the more convenient normal-theory procedures (21, 44). In a Herculean permutation simulation that predated the initial publication of *The Design of Experiments*, Eden and Yates confirmed that, in some situations at least, Fisher's analysis of variance could be applied meaningfully to skewed data.(15)^{5} In contrast, Pitman used an approximation to circumvent the need for laborious calculations and demonstrated that permutation methods could be applied to analyze–for samples from any kind of population–the difference between sample means, correlation, and analysis of variance (46–48). Using the same kind of approximation, Welch showed that permutation methods could be applied also to more complicated experimental designs (55).

In a memorial essay (56), Yates argued that Fisher regarded as untenable the routine use of permutation methods. Fisher himself added this brief section to his seventh edition (22) of *The Design of Experiments*:

In recent years tests using the physical act of randomisation … have been largely advocated under the name of [nonparametric] tests … The [Darwin] example of this Chapter, published in 1935, was by many years the first of its class. The reader will realise that it was in no sense put forward to supersede the common and expeditious tests based on the Gaussian theory of errors. The utility of such nonparametric tests consists in their being able to supply confirmation whenever, rightly or, more often, wrongly, it is suspected that the simpler tests have been appreciably injured by departures from normality.

With this brief history, we are almost ready to begin our exploration of permutation methods. First, we need to review the software we will use to help us learn about them.

### R: Basic Operations

In the first article (4) of this series, I summarized R (49) and outlined its installation. For this exploration there are two additional steps: download Advances_Statistics_Code_Perm.R^{6} to your Advances folder and install the packages beeswarm, coin, and gtools.^{7}

To install these three packages, open R and then click Packages | Install package(s) … ^{8} Select a CRAN mirror close to your location and then click OK. Select beeswarm, coin, and gtools and then click OK. When you have installed these packages you will see

#### To run R commands.

If you use a Mac, highlight the commands you want to submit and then press (command key+enter). If you use a PC, highlight the commands you want to submit, right-click, and then click Run line or selection. Or, highlight the commands you want to submit and then press Ctrl+R.

### Permutation Methods: an Overview

When we explored hypothesis tests and *P* values (6), we developed the notion of a theoretic distribution of the test statistic *t* by drawing 1000 random samples from a standard normal distribution with mean μ = 0. To assess whether the observations in our first sample were consistent with having come from this population, we calculated *t* = 3.407. We based our interpretation of this value of 3.407 on a theoretic distribution of *t* (see Ref. 6, Fig. 3). What did we conclude? Because the test statistic *t* = 3.407 (*P* = 0.005) was more unusual than our critical significance level of α = 0.10, we concluded that the sample observations were consistent with having come from a population that had a mean other than 0.

This simulation illustrates the *population model* of statistical inference, a model advanced by Neyman and Pearson (41) and labeled by Lehmann (31). The academic difficulty with this model is that it is almost never realized in practice.

Instead, we have at hand subjects or experimental units–people, sheep, rabbits, cell cultures–that we assign at random to different experimental groups.^{9} This *randomization model* of statistical inference was defined explicitly by Kempthorne (28). In the strict application of this model, the theoretic distribution of some test statistic is not relevant. So the question becomes, in this situation, how do we obtain that ubiquitous *P* value?

Suppose we have six rabbits in which we want to examine the impact of some intervention on changes in brain blood flow. We randomly assign three rabbits to *group 1*, the control group, and the three remaining rabbits to *group* 2, the experimental group. We define our null hypothesis to be that changes in brain blood flow are similar in the two groups. As it turns out, the rabbits in *group* 2 had the biggest changes in brain blood flow: on average, brain blood flow increased 3 ml/min more than in *group 1* (Table 1, *arrangement 1*).

How do we use the observations from the six rabbits to assess our null hypothesis? If the null hypothesis is true–if changes in brain blood flow are similar–then the observations are exchangeable (24). This means that out of all the ways we can rearrange our observations, the arrangement we got is a typical one (24, 36). How do we implement this abstract idea to get a *P* value? For each of the 19 other ways we can rearrange our 6 observations (see Table 1), we calculate the difference between sample means and then ask this question: out of all the possible ways we can rearrange the observations we got, in what proportion of those arrangements is the difference between sample means at least as extreme as the one we got? The answer to that question is our *P* value.

In 2 of the 20 possible arrangements of these observations was the magnitude of the difference between sample means at least as big as 3, the one we got (Table 1, *arrangements 1* and *20*)*.* Therefore, *P* = 2/20 = 0.10. The appendix extends this example.

### A More Realistic Two-Sample Example

Imagine that we want to study the impact of diet on plasma cholesterol in women.^{10} We randomly assign seven women to *group 1*, a diet of fish, and five women to *group 2*, a diet of meat. We define our null hypothesis to be that, after 1 year on a diet of fish or meat, plasma cholesterol levels will be similar in the two groups, and we establish a critical signficance level–a benchmark for uncommonness–of α = 0.05. Table 2 lists the observations from this fictitious experiment (see also Fig. 1).

In our second exploration we mentioned that a one-sample *t* statistic
*¯*} = *s*/
*s* is the sample standard deviation, and *n* is the number of observations in the sample could be adapted to assess whether two sets of sample observations were consistent with having come from the same population (see Ref. 6, *footnote 8*). To do this, we simply replace the single sample mean *¯* with the difference between sample means, *¯*_{2} − *¯*_{1}:
*t*, we opt for the more versatile form of SE {*¯*_{2} − *¯*_{1}},

For the moment, we will set aside concerns that the theoretic distribution of *t* as defined in *Eq. 1* may not apply under this randomization model.^{11} In this case, *s*_{2}^{2} = 3.658, *s*_{1}^{2} = 0.392, and SE {*¯*_{2} − *¯*_{1}} = 0.8875. Therefore,
*lines 61–79* of Advances_Statistics_Code_Perm.R return these values. If the null hypothesis is true–if plasma cholesterol levels are similar–how usual is this value of *t?* If the null hypothesis is true, we expect to observe a value of |t| at least as big as 2.017 about 1 time in 10 *(P* = 0.104 ≐ 0.10). This is more usual than our benchmark of 0.05. As a result, we fail to reject the null hypothesis and conclude that–although there is a suggestion that diet affects plasma cholesterol levels (see Ref. 11, Table 1)–the sample observations are consistent with having come from the same population: that is, plasma cholesterol levels are similar in the two groups.

But we know from our exploration of the bootstrap (7) that this conclusion is meaningful only if the theoretic distribution of the difference between sample means is roughly normal. The bootstrap distribution provides clear evidence it is not (Fig. 2). Great. Now what? You guessed it: a randomization test.

The first question is, in how many ways can we rearrange our 12 observations? The answer is, in
^{13} This is more unusual–less likely to occur–than our benchmark of 0.05. In contrast to our conclusion above, we reject the null hypothesis and conclude that plasma cholesterol levels differ between these two groups of women. We revisit this discrepancy in *Practical Considerations*.

In our overview of permutation methods, we said that if the null hypothesis is true, then the observations are exchangeable. This means we assume the observations came from populations that share a common standard deviation. In some situations, a randomization test can detect a difference in standard deviation rather than a difference in mean, but this is not inevitable (Fig. 4), and methods exist to deal with this phenomena (40).

### An Example in Correlation

When we explored correlation, we used data from Snedecor and Cochran (51) to show that one observation can distort the magnitude of a correlation coefficient (see Ref. 8, Fig. 8). We can use the same data (Fig. 5) to demonstrate the application of permutation methods to correlation.

In the setting of correlation, if there is no relationship between two variables, then every possible arrangement of those variables is equally likely (16, 24). Now the question is, how do we rearrange our observations? We fix the *x* values and permute the *y* values. For these data, there are 11! = 39,916,800 possible arrangements of the *y* values (Table 4 and appendix). Rather than calculate a correlation coefficient for each of those 11! arrangements, we calculate a correlation coefficient for 10,000 of those arrangements chosen at random.^{14}

In how many of the 10,000 arrangements is the correlation coefficient *r* at least as extreme as |0.56|, the actual magnitude we observed? The answer is, in 770 of them (Fig. 6). Therefore, if the null hypothesis is true, the probability that we would have observed a correlation at least as big as |0.56| is 770 times in 10,000 (*P* = 770/10,000 ≐ 0.08). The commands in lines *186–202* of Advances_Statistics_Code_Perm.R return these values. Your values will differ slightly.

### Practical Considerations

In our examples, we used the same strategy (38):

*1*. Define the problem–the null hypothesis–we care about.

*2*. Identify and then calculate a sample statistic that is relevant to the null hypothesis.

*3*. Rearrange the observations in a way that is consistent with the null hypothesis. For each arrangement, calculate the sample statistic.

*4*. Calculate the proportion of sample statistics in the permutation distribution that are more extreme than the value we observed in the actual arrangement of the observations.

It is the final step that produces the *P* value. This strategy applies to a broad spectrum of experimental designs (16, 24, 36).

In the 1930s, when Eden, Yates, Fisher, Pitman, and Welch pioneered permutation methods (15, 20, 46–48, 55), the computational effort was prohibitive. Today, that effort is trivial. Although we can write R scripts to do these procedures, commercial software packages such as StatXact (13) simplify the process.^{15}

But still, are permutation methods really worth the effort? When John Tukey writes

No other class of approach provides significance information of comparative quality.

(Ref. 53, p. 18) and when Brad Efron writes
When there

(Ref. 17, p. 218) and when other statisticians (18, 24, 28, 35) also endorse the procedures, it is worth our while to pay attention.*is* something to permute … it is a good idea to do so, even if other methods like the bootstrap are also brought to bear.

You are starting to warm to permutation methods. But when do you use them? Although some statisticians advocate that we use a permutation method for the actual statistical analysis whenever we can (17, 18, 24, 28, 35, 53), we can also use a permutation method as we did the bootstrap (7): to assess whether an inference we make from a more traditional hypothesis test is justified. How do we do this? If our conclusion from permutation matches our conclusion from the traditional test, then the assumptions for the traditional procedure are reasonably well met (16, 20, 24, 36).

We have one last question to address: what happens if our conclusion from permutation conflicts with our conclusion from the traditional test? We should suspect that the assumptions for the traditional procedure have not been met, as in our two-sample example. It almost goes without saying that we want to opt not for the statistical procedure that produces the result of our dreams but for the statistical procedure that has its assumptions best satisfied.

### Summary

As this exploration has demonstrated, permutation methods give us an approach we can use to assess an experimental result when we are reluctant to trust statistical theory alone (16, 24, 36). Like the bootstrap, permutation methods are empiric: they operate on the observations–the data–we get from an experiment. Unlike the bootstrap, the primary utility of a permutation procedure is in the test of a scientific null hypothesis. Although we have explored permutation methods using a basic two-sample problem and a problem in correlation, we can use them to reach conclusions about scientific results in other experimental designs (16, 24, 36).

In the next installment of this series, we will explore the analysis of ratios. As researchers, we use ratios to normalize a numerator to some denominator: to control for differences in the denominator when the thing we really care about is the numerator. In physiology, ratios are quite common. The problem is that the analysis of ratios is quite complex. In the next exploration, we will see why.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

## ACKNOWLEDGMENTS

I thank John Ludbrook (Department of Surgery, The University of Melbourne, Melbourne, Victoria, Australia), Gerald DiBona (University of Iowa College of Medicine, Iowa City, Iowa), and Bryan Manly (Western EcoSystems Technology, Cheyenne, Wyoming) for helpful comments and suggestions, and I thank Cyrus Mehta and Cytel Software for graciously granting me the opportunity to explore StatXact.

## APPENDIX

This appendix reviews nomenclature related to permutation methods and illustrates a practical extension of our overview of permutation methods.

#### Nomenclature

To be honest, I struggled with the nomenclature associated with permutation methods. In part I struggled because the nomenclature is divorced from the mechanics of the procedures. A *permutation test* refers to the possible rearrangements of observations among groups of subjects when those subjects have been drawn at random from their populations (16, 29). A *randomization test* refers to the possible rearrangements of observations among groups of subjects when those subjects have been drawn not at random but simply assigned at random to different experimental groups (16, 29).

To make matters worse, I had vague memories from high school of the definitions (54) of permutations and combinations. In permutations, order matters. Suppose we have *n* things and we want to ask, in how many ways can we arrange these *n* things? The first time we choose, we have *n* things from which to choose. After we have chosen one of those *n* things, we have *n* − 1 things from which to choose. This continues until we have chosen *n* − 1 things, at which point we have just one thing from which to choose:
*n* things? In *n*(*n* − 1)* (n* − 2) … 1 = *n*! ways, or in *n*! permutations. These are the mechanics we used with correlation (see Table 4).

In combinations, order does not matter. Suppose we have *n* things and we want to ask, in how many ways can we choose *r* of these *n* things? This means we will have two groups: one with *r* things and one with *n* − *r* things. We can calculate the number of ways–the number of combinations–as *n* choose *r* :

Regardless of whether we use permutations or combinations to do a *permutation test* or a *randomization test,* the concept is the same: out of all the possible ways we can rearrange the observations we got, in what proportion of those arrangements is our sample statistic at least as extreme as the one we got?

#### A Practical Extension

In our overview, we used two groups of three observations to illustrate the framework of permutation (see Table 1). We discovered that in 2 of the 20 possible arrangements of those nonoverlapping observations, the magnitude of our sample statistic was at least as extreme as the actual difference we got (*P* = 2/20 = 0.10).

I have reviewed papers in which the authors analyzed two groups of three observations using a *t* test. This is reckless. The only way a *t* test can be applied meaningfully to two groups with so few observations is if the physiological thing we care about is distributed normally. With three observations in each group, there is simply no way to know. Using a permutation procedure, the smallest *P* value we can achieve with two groups of three observations is *P* = 0.10.

Suppose we have two groups of four observations. We still have no good way of knowing if the physiological thing we care about is distributed normally, and so we still dismiss the *t* test. If we use a permutation approach, there are
*P* = 2/70 = 0.03.

## Footnotes

↵1 The appendix reviews nomenclature related to permutation methods.

↵2 Fisher concurrently detailed a permutation method for categorical data, a procedure we know today as Fisher's exact test.

↵3 Darwin's experiment did not truly warrant this pairing of observations (27).

↵4 Recall that Fisher defined a null hypothesis but no alternative hypothesis (6).

↵5 Although Fisher was known for fierce exchanges with other statisticians (see Ref. 6), it is unlikely he felt affronted by Eden and Yates: they worked with him at Rothamsted Experimental Station.

↵6 This file is available through the Supplemental Material link for this article at the

*Advances in Physiology Education*website.↵7 Functions of the coin package perform conditional inference, inference based on actual observations (26).

↵8 The notation click

*A*|*B*means click*A*, then click*B*.↵9 For example, a control group and an intervention or treatment group.

↵10 Ludbrook and Dudley (35) used this example. Assume that the women have the same plasma cholesterol value beforehand.

↵11 Some statisticians say that

*t*tests apply meaningfully to the randomization model of inference (see Ref. 39).12 The appendix explains this calculation.

↵13 The commands in

*lines 84–139*of Advances_Statistics_Code_Perm.R list the seven arrangements and do this randomization test.↵14 The number of replications can vary from 1,000 to 10,000 (16, 17, 24, 36).

- Copyright © 2012 the American Physiological Society