some disturbing news was reported in the March 2005 issue of *Significance*, a publication of the Royal Statistical Society (7). Based on a variety of audits, it was found that 38% of the articles published in 2001 in *Nature* contained some statistical “incongruence”: a disparity between reported test statistics (*t*-tests, *F*-tests, etc.) and their corresponding *P* values. The *British Medical Journal* (BMJ) faired a little better, with 25% of the articles containing at least one incongruence. A subsequent audit of *Nature Medicine* for 2000 “showed clear evidence that the authors did not even understand the meaning of *P* values.”

With understated regret, the *Significance* article notes that, although all *Nature* journals will eventually adopt guidelines regarding the use of statistics, “the journal will not introduce a statistical review process.” The approach taken by BMJ's Deputy Editor seems to reflect philosophical resignation: “Research done at the BMJ shows that peer reviewers identify only a minority of major errors in a manuscript–so what hope is there of them identifying these minor ones?” This seems only a short step away from saying “Why bother? And, besides, it doesn't matter anyway.”

Well, of course it does matter. Faulty statistical analyses can result in wasted research resources and, worse still, compromise the health of research animals, human subjects, and the ultimate recipients of therapies. The American Physiological Society (APS) is therefore to be commended for taking on this issue. What precisely should be done about the matter is a greater challenge, however. In 2004, Curran-Everett and Benos leapt into the fray by providing 10 guidelines regarding statistical approaches (4) and now report that nobody appears to be heeding their advice but, at least, they didn't hear any grousing from statisticians. Of course, the fact that they didn't hear grousing doesn't mean that it doesn't exist and so I'll go on record: although I believe that their intentions are good, in some ways I fear that they have just muddled the issue further.

They're just guidelines, right? So what's the problem? Let me digress for a moment. For the most part, I think statistics is poorly taught. Not that the instructors are bad, necessarily, but rather that we aim too high. Realistically, we might hope that a student who has finished a one-semester, introductory course is able to read reports and statistics cited in a newspaper more critically. With loftier goals, we try to jam in more information, and this leads to a formulaic, algorithmic approach: if the data look like this, then the analysis should be that; if the *P* value falls below this particular value, then you should reject the null hypothesis; if you have two groups, it's a *t*-test; three groups and it's ANOVA, etc. We bypass the complexities and leave students believing that, upon the completion of the semester, they are now qualified to “do statistics.” That's hopelessly naïve; most experiments yield data far more complex than can be handled by a semester or two of statistics.

Of course, an algorithmic approach is easier to learn and teach (and grade), but it ultimately does not serve the practitioner well. Caveats notwithstanding, I fear Curran-Everett and Benos are reinforcing this perspective. Too many of these guidelines have a nearly algorithmic, dictatorial sound: “the right thing is to do *x*.” This is too easily transmuted into rules that take the form of “if you don't do *x*, then it's wrong,” followed by a citation to the guidelines. I have, on a number of occasions, had precisely this happen: a reviewer, not knowing a lot of statistics, but knowing the guidelines well, will reject a manuscript because the statistics are “wrong.” Well, no, they're not wrong, but they involve subtleties that the guidelines don't cover. In an attempt to clarify and simplify, the guidelines actually prove to be an annoying, if not worse, tool in the hands of the nonexpert.

Despite this potential for annoyance, I must say that some of the guidelines I absolutely endorse. I can't, for example, object to *guideline 1* (Consult a statistician), although I think a critical point of having a statistician involved in project and experimental design is that it can lead to simpler, clearer analyses and interpretations. (I also appreciate and agree with the notion that the statistician “can help;” I'm sorry that some statisticians take it upon themselves to be officious, instead.) *Guideline 3* (Identify your methods, references, and software), *guideline 7* (Report precise *P* values), and the interpretations of *P* values in Table 1 all sit well. [Some of these points can also be found in the elegant, less prescriptive 1988 offering by the preeminent statisticians Mosteller and Bailar (2).]

From this point the waters get deep. Curran-Everett and Benos describe their guidelines as representing best practices in statistics. I cannot agree–certainly they would not receive uniform support among statisticians. That great pioneer of statistics, Fisher, would certainly appreciate *guideline 7* but would abhor *guideline 2* (Use a predetermined critical value). *Guideline 2* reflects more the philosophical approach of Neyman and Pearson, also giants in the field, who specifically advocated the use of critical values like α. [There was mutual animosity between Fisher and the Neyman-Pearson school, as evidenced by their occasionally intemperate remarks on these issues (5, 6).] In a curious ecumenicism, however, the authors' interpretation of “if the *P* value is less than α, then the experimental effect is likely to be real,” fits nicely with the Bayesian school but is a patently incorrect interpretation of a *P* value from either a Neyman-Pearson or Fisherian view.

*Guideline 10* (Interpret each main result with a confidence interval and precise *P* value) repeats the error of interpretation and compounds the problem with poor advice: “If either bound of the confidence interval is important from a scientific perspective, then the experimental effect may be large enough to be relevant.” The “may be” leaves the authors an out, but the critical issue is that, if the sample is too small, then the confidence interval will be large, thus *increasing* the chance that one of the bounds of the confidence intervals is, apparently, “important from a scientific perspective.” This confusion underlies *guideline 6* (Report uncertainty about scientific importance using a confidence interval) as well. Although I appreciate the technical accuracy of the definition of a confidence interval, the authors oversimplify. It is correct to say that a confidence interval characterizes uncertainty about the true value of a parameter but incorrect to say that it provides an assessment of scientific importance.

Both in their advice regarding critical values and confidence intervals, Curran-Everett and Benos have overstated what these things actually provide, perhaps in favor of what they wish they would provide. The discussion of *guideline 6* leads to another issue, by the way: for a data graphic, they advise using a confidence interval. Not everyone agrees; it depends in part on the intent, as noted by Andrews et al. (1).

I have a minor quibble with *guideline 4* (Control for multiple comparisons), whose principle I support completely. However, in 1973, Carmer and Swanson made it pretty clear that the Newman-Keuls procedure is inferior to the protected least-significant difference procedure (3). Referring to an outmoded technique again leads to the risk that such mention reflects an endorsement.

We're left with *guideline 5* (Report variability using the SD), which I understand raised a lot of concerns among some APS members. That's understandable: it appears to go against tradition. My concern is that it might equally go against logic. Yes, the SD does indicate variability of a single observation. But it's actually the rare instance where we want to know such a thing. Such an instance can arise when we want to develop normal ranges for some diagnostic result: it's useful to know the SD of systolic blood pressure, because then we can calculate a 95% interval and, more importantly, know when someone's blood pressure lies outside the “normal range.”

But, in most circumstances in reporting experimental results, we want to compare means of different groups, and the immediately relevant quantity is the SE. Yes, citing the mean ± SE only gives a 68% confidence interval, but a quick trick is that the mean ± 2 × SE gives a good approximation to a 95% confidence interval. And, although it's true that the SD and SE are related, performing the mental math to convert SD to SE is nontrivial. But why not make a more useful recommendation: report, on a graph or in text, the SD or SE, depending on what information you want to convey. If you want to compare means, use SE; if you want to look at variation among a group of individuals, use SD. For other purposes, you might use something else again (1).

I truly appreciate the authors' goals: to improve the quality of statistics in the journals and, of course, in the accompanying science. I also have sympathy for authors, reviewers, and editors. Who doesn't want a short, simple set of rules? Unfortunately for the nonspecialist, the field of statistics is deep, complex, and evolving. In each of the last several years, for example, the journal *Statistics in Medicine* has been publishing about 4,000 pages of articles focused principally on the development of new methods for analyzing data. Over the last 20 years or so, the manual for the statistical analysis component of SAS has expanded from about 1,000 pages to roughly 5,000 pages. Aspects of apparently routine methods such as regression and ANOVA are under continuous refinement, and methods employed today are often quite different from those used even a few decades ago. It is difficult to keep up.

So what to do? I have two suggestions. First, I think the APS editorial board should indicate in their instructions that authors are responsible for ensuring that the statistics in articles are correct, appropriate, follow modern practices, and are well-presented and that they are subject to review. And mean it. This is effectively what the Federal Drug Administration does in drug trials and what the National Institutes of Health does in evaluating grant applications involving human subjects. The burden is on the author, of course, to ensure that there is adequate quality in the work.

Second, I think we need to move away from the idea that statistics is a technical tool, like a pH meter, and recognize that it is a scientific discipline, requiring considerable training, skill, and practice. We recognize more and more the need for interdisciplinary teams to solve research problems, and so, for example, a given project might benefit from the input of a neurochemist, a muscle physiologist, an expert in ion channel function, and a proteomicist. Each contributes their own expertise, each acknowledges their own limitations, and the combined effort is a superior product. Journals do not need to provide detailed guidelines on how to conduct or report the research insofar as high standards result from the review process. So why not treat statisticians on an equal footing, both as collaborators and reviewers? All too often there is an asymmetry in the way that statistics are treated in a research project compared with, say, biochemistry, genomics, etc.

Another curious asymmetry arises in how we handle data. It is curious to me that researchers might spend months collecting tissue samples, say, and then many more months performing assays on these samples, but the actual data analysis often gets short shrift. Why? Why, given the extensive resources and time it takes to collect the data, do some people expect to be able to do the analysis in an afternoon? Why would they want to?

In 2000, in its endorsement of the Mathematical Association of America *Guidelines for Programs and Departments in Undergraduate Mathematical Sciences*, the American Statistical Association noted that “Generic packages such as Excel are not sufficient even for the teaching of statistics, let alone for research and consulting” (http://www.amstat.org/Education/index.cfn?fusoradian-ASAendorsement). Numerous other articles document numerical errors, misstatements, and various weaknesses in that software. Why is there acceptance, therefore, of using mediocre analysis tools in statistics when high standards are held for the use of other tools and techniques in the sciences?

I don't think it is needlessly idealistic to think that statistics can be an equal partner in the sciences. I have seen this model work countless times, and I have had the good fortune to participate in it. Despite my quibbling with the details of the suggestions by Curran-Everettt and Benos, I have complete sympathy with their efforts to improve the situation. However, rather than promulgating a handful of guidelines, I believe it will take support from APS members as a whole, and not just a few advocates, to make positive change.

- © 2007 American Physiological Society