Joanna Poe/Flickr (CC BY-SA 2.0)

It will be much harder to call new findings ‘significant’ if this team gets its way

A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005.

Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results—studies that claim to find an effect when there is none—and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

“If we’re going to be in a world where the research community expects some strict cutoff … it’s better that that threshold be .005 than .05. That’s an improvement over the status quo,” says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. “It seemed like this was something that was doable and easy, and had worked in other fields.”

But other scientists reject the idea of any absolute threshold for significance. And some biomedical researchers worry the approach could needlessly drive up the costs of drug trials. “I can’t be very enthusiastic about it,” says biostatistician Stephen Senn of the Luxembourg Institute of Health in Strassen. “I don’t think they’ve really worked out the practical implications of what they’re talking about.”

A fraught value

The p-value is a notoriously elusive concept for nonstatisticians. Too often, it is misinterpreted to be the probability that the hypothesis being tested is true, says Valen Johnson, a statistician Texas A&M University in College Station and an author on the new paper. The reality is more complicated. For a test of a new drug in a clinical trial, for example, a p-value of 0.05 really means the results observed—or even more extreme results—would occur in one in 20 trials if the drug really had no benefit over the current standard of care. But it’s often wrongly described as a 95% chance that the drug actually works.

To explain to a broader audience how weak the .05 statistical threshold really is, Johnson joined with 71 collaborators on the new paper (which partly reprises an argument Johnson made for stricter p-values in a 2013 paper). Among the authors are some big names in the study of scientific reproducibility, including psychologist Brian Nosek of the University of Virginia in Charlottesville, who led a replication effort of high-profile psychology studies through the nonprofit Center for Open Science, and epidemiologist John Ioannidis of Stanford University in Palo Alto, California, known for pointing out systemic flaws in biomedical research.

The authors set up a scenario where the odds are one to 10 that any given hypothesis researchers are testing is inherently true—that a drug really has some benefit, for example, or a psychological intervention really changes behavior. (Johnson says that some recent studies in the social sciences support that idea.) If an experiment reveals an effect with an accompanying p-value of .05, that would actually mean that the null hypothesis—no real effect—is about three times more likely than the hypothesis being tested. In other words, the evidence of a true effect is relatively weak.

But under those same conditions (and assuming studies have 100% power to detect a true effect)—requiring a p-value at or below .005 instead of .05 would make for much stronger evidence: It would reduce the rate of false-positive results from 33% to 5%, the paper explains.

“The whole choice of .05 as a default is really a kind of numerology—there’s no scientific justification for it,” says Victor De Gruttola of the Harvard School of Public Health in Boston. The paper “exposes that there can be a false sense of security with the .05 default.” He doubts the results will be news to statisticians, “but I think a lot of investigators whose primary focus is not on these kinds of issues may be surprised.”

Significant, or just suggestive?

The authors are careful not to endorse the use of p-values as the ultimate measure of significance; many scientists have argued that they should be abolished altogether. But in the many fields where a p-value below .05 has become a gold standard, the authors propose a rule of thumb for new findings: “Significant” results should require a p-value below .005; results with p-values below .05 but above .005 should be called merely “suggestive.”

Even supporters of the study—and some of its authors—are wary of any absolute threshold. 

De Gruttola points out that the right cutoff for significance depends on what evidence already exists for the hypothesis being tested, and the relative consequences of acting on a false-positive or a false-negative result. “Would you be using the wrong toothpaste” if you act on a false result, he asks, “or would you be getting the wrong drug for a serious illness?” Still, he’s confident that a .005 significance threshold is preferable to .05.

But not everyone is on board. Psychologist Timothy Bates of the University of Edinburgh, in a response on publishing platform Medium, called the proposal “a risky distraction” from the root causes of irreproducible results. Downgrading a finding from “significant” to “suggestive” wouldn’t change which results get published or how they’re generally interpreted, he argued. And it wouldn’t address many other practices linked to irreproducible results: poor study design, a bias toward publishing positive results, and the practice of “p-value hacking”—fishing for significant-looking results from a huge number of hypotheses. (The authors acknowledge that their solution is only one step among many necessary to make published studies more reproducible.)

Researchers focused on drug development have another big misgiving: The new standard could force up the required size of a trial by as much as 70%, according to the authors’ estimates. “If you’re a pharma company … you’re going to miss quite a lot of reasonable drugs, maybe, simply because you’re not going to have the resources to look at as many drugs,” Senn says. “Anything you don’t study has a sample size of zero.”

The authors, however, see the silver lining in that scenario: Fewer resources would be wasted on studies following up on false-positive results. And they’re careful not to argue that a p-value above .005 should be a death knell for publication or following up on a hypothesis. Their main message, Benjamin says, is that a p-value of 0.05 is much weaker evidence than most researchers realize. “If this paper helps to spread that message, then that is a major win for people’s understanding of empirical evidence.”