Key Points

Comparisons propose no difference, and then ask “How probable?”

Misclassification is inevitable from time to time: false conclusions result

Families of observations are best only tested once

The more comparisons, the more likely is misclassification

For several comparisons in one family, test criteria should be more stringent
scientists frequently want to answer the question “has this treatment had an effect?” Most are unaware that the tests they usually use do not directly address this question. These tests usually pose a different question, based on the possibility that nothing has happened. The question becomes: “how probable are these data, if there were NO difference between the original populations from which the data have been randomly drawn?” (In fact for most laboratory experiments, this supposition is patently false–the experiment has been conducted using a preordained sample, possibly randomly divided into treatment and control groups, but certainly not randomly sampled.)
However, if we continue to consider the usual analysis that is used, we have to assume that we have random samples, from the same population. Such samples will always differ to some extent. Occasionally, the difference might be substantial, large enough to suspect that they might not have come from the same source population. The usual context in which we use this test is that the data are already “under suspicion”: we usually don't want to believe the null hypothesis at all, and we are testing to see if the data are unlikely to be consistent with this hypothesis. To assess how “suspicious” our results can be, we estimate how frequently we might obtain results like ours:

if the “null hypothesis” were true,

if we were to repeatedly sample the population, and

if the results are workings of chance.
Generally, we reject the null hypothesis if chance alone could yield data like ours less than 1 time in 20 (or equivalently 95 times out of 100), an arbitrary and probably unnecessarily inflexible value (6). We believe our suspicions are justified, and we can then accept the alternative hypothesis: the samples are not from the same population. We rarely employ the same cautious vocabulary as the statistician, who might qualify this interpretation. The researcher assumes, wrongly, a possibility of 0.05 (i.e. 5%, or 1 in 20) to indicate that the null hypothesis is false, there is therefore a genuine difference, and that the result can be reliably replicated (1). The more cautious statistician would argue that the findings are consistent with that conclusion, but are not unequivocal. Indeed, this is so: if a single experiment just meets the level of significance, it is just as likely NOT to give a significant result if the same experiment were to be repeated. It's a bit like exams: the marks of students that just fail are often considered carefully, in case the examiners have been too severe, but the students that scrape through are illogically allowed to pass without further scrutiny. In testing our results, we accept the fact that we may conclude that an intervention has had a “real effect” when in fact we may be wrong 5% of the time: this is the type I error rate (Figure 1).
Usually scientists do one experiment at a time, or at least they think they do: the experiment asks the question “has this treatment had an effect?”, and if P < 0.05 we accept that we have found an effect. However, in many experiments a variety of factors such as time, expense, resources and concern about animal use may lead to a single study asking several questions. Each of these may require statistical testing. As soon as more answers are sought from the same data, the trustworthiness of the answers can change. The error rate in the experiment as a whole (experimentwise error rate) will increase, and become greater than the error rate in each comparison (comparisonwise error rate).
However, the context of experiments and tests can vary, and context affects the logic of the tests. Yet again, statistical terminology can be confusing, and clear definitions are often lacking. Suppose we go back to our Californian frogs, and choose to study samples supplied by a dealer. She assures us that these are random samples from seven different counties: Alpine, Butte, the wellknown Calaveras county, Del Norte, El Dorado, Fresno and Humboldt (California lacks a county with the initial G). We measure how far they can jump, and use analysis of variance to assess the possibility that performance may differ, according to origin (Figure 2).
The ANOVA (analysis of variance) test assesses the possibility that all these samples have come from a single population. This is an “omnibus” test, considering all the results together and comparing the variation (and potential differences) between and within all the groups. It's a good name: we load all the sets of data into the omnibus and test them together. Subsequently we may choose to make a “family” of comparisons between the data sets present in the “omnibus”. “Family” is a more difficult statistical concept, and often loosely used, different authors expressing different opinions. Ludbrook suggested “A family of hypotheses is all those actually tested on the results of a single experiment” and also that a family is “all those experimental observations that could be analysed statistically by a global procedure” (such as an omnibus test) (5). Perhaps it's as well to bear in mind that data families, like social ones, can breed trouble.
Looking at our results, we're surprised that ANOVA suggests there is no evidence of a difference between the groups. We could have sampled data like this, or even more extreme, about 40% of the time. This can happen: differences can be detected with further tests that are not shown up by an omnibus test. We conducted this study with the suspicion that there could be a difference here, because one group comes from Calaveras county. So we start to compare the groups in pairs. (We ignore more complex possibilities; it's possible that we might want to compare northern counties with southern counties, and so on.) There are 21 ways to conduct simple pairwise comparisons: the family is shown in Figure 3.
Here we have 21 comparisons (the comparison of A vs. B will yield the same result as B vs. A, so this doesn't need a separate comparison), and repeat comparisons alter the overall error rate. Table 1 gives the results in order of P value.
Two of the comparisons in the lefthand column would be “significant” if they had been tested individually. In a single comparison, when we take P = 0.05 as a threshold probability, we know our conclusion could be wrong. Nevertheless we accept this possibility since, in the long run, our decision would only be wrong once out of 20 times. This is the “comparisonwise” error rate, and it is the same as the probability that we set as our threshold for “significance”. In contrast, if we make 21 comparisons, the risk that at least one of these several comparisons could lead to an error increases substantially. The risk of error, in several comparisons, is shown in Figure 4.
The likely error in a family of tests increases with each set of data, up to the point where with seven sets of data, there are 21 tests, and the overall risk of error is 0.67. This is the experimentwise or familywise error rate. The usual solution proposed to the problem imposed by the multiple tests is to impose a more stringent threshold for “significance”. The advantage is that we are less likely to have false positive classifications. The disadvantage is equally clear: with a more stringent criterion, we will fail to detect occasions where the null hypothesis is not “true”. In other words, false negatives will become more common (Figure 5).
In the case we are considering here, we predict that the Calaveras frogs will be better jumpers, so we conduct a complex comparison between Calaveras versus all the others. This would be an “a priori” test. Some authorities would consider a comparison of this sort to be acceptable with no change to the test criterion. We move into considerations of design, motive and the need to balance the risks of confirming bias, missing interesting or important new information, or making a decision with insufficient evidence. In many instances, scientific papers are brimful of twoway comparisons and we cannot be sure that we are not just being presented with a chance finding that has resulted from a succession of comparisons, as the authors seek for something positive to report.
Naturally, many studies present several experiments, for example there may be experiments to show that a gene has been deleted, does not generate RNA, that a receptor protein is absent, that stimulation is ineffective. These are all separate experiments and can be legitimately assessed without adjusting the test criterion. Indeed, a single “a priori” test is a rare event.
However, multiple testing can often be avoided. There are sensible steps that can be taken to avoid conducting a plethora of comparisons. One is to combine data into a summary statement or expression. A simple and obvious example might be a doseresponse curve (3, 5), or a growth trajectory, where separate groups of data can be summarized into a single relationship. More subtly, this is the principle found in the interaction term in the MANOVA (multivariate analysis of variance) procedure. However, MANOVA itself is a “multiple” test, since it can yield several F values (each of which represents a test result). Another safeguard to avoid inconsistency is to set a composite hypothesis to avoid contradictory findings: for example both A and B have to be better, or A has to be better and B not worse. (This latter test is a noninferiority comparison.)
A different approach is to use a procedure that does not control the error rate, but concentrates on the rate at which false positive conclusions are likely. These simple procedures use the P values generated by each test, and are particularly useful because the results from different types of tests can be considered together. The primary comparisons are well sustained, even if a lot of additional tests are done (2).
Look again at the P values in Table 1. The righthand column shows more rigorous P value thresholds. The threshold P value chosen for a single test has been divided by the total number of tests. Thus for all 21 comparisons the corrected value should be 0.05/21, which is ∼0.0025. If the corresponding P value in the lefthand column were less than this, then the next larger P value would be compared with a threshold value corrected using N − 1, i.e., 20. This gives a sequence of thresholds in the righthand column that is progressively more lenient. We find that there are no significant comparisons in our table. The last P value is of course 0.05, but this is far less than the corresponding P value for the 21st comparison, which is 0.99. (This method is a RyanHolm stepdown procedure.)
If such techniques are not used, then one of the many methods for dealing with multiple comparisons will be required to reduce the impact of an elevated false discovery rate. There are many of these–too many to describe in a short introduction–and there is far from total agreement over which tests are best. Some books suggest that with unplanned comparisons the conclusions should be graded according to differences found, into occasions where the null hypothesis should be retained, be rejected, and into an intermediate group of “not proven” verdicts. Others, more lenient, suggest that if the data are really being “explored” then correction for multiple tests may not be needed, or if there is an “a priori” proposal that particular test need not be corrected, but further tests should be. The most stringent verdict is that all comparisons should be corrected and analysis should be conducted “independent of expectation”. Horton exemplified the pitfalls of raking through the coals of an experiment to find undiscovered treasures (4). The authors of a paper were asked to conduct an unplanned comparison in subgroups of subjects, because the assessors thought there were features of interest. They agreed, on the understanding that the first feature they would analyse was the star sign of the participants, and they showed how this could be statistically interpreted as an important factor in drug response.
We have now discovered that our frog dealer was a fraud: all the frogs in our example were sampled randomly from the same population. We should have realized!
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the author(s).
AUTHOR CONTRIBUTIONS
Author contributions: G.B.D. prepared figures; G.B.D. and S.L.V. drafted manuscript; G.B.D. and S.L.V. edited and revised manuscript; G.B.D. and S.L.V. approved final version of manuscript.
Footnotes

This article is covered by a nonexclusive license between the authors and the Nutrition Society (London, UK) and is being simultaneously published in 2011 in The Journal of Physiology, Experimental Physiology, British Journal of Pharmacology, Advances in Physiology Education, Microcirculation, and Clinical and Experimental Pharmacology and Physiology as part of a collaborative initiative among the societies that represent these journals.
 Copyright © 2012 The American Physiological Society
Licensed under Creative Commons Attribution CCBY 3.0: the American Physiological Society.