“It was incredible” to see how the online paper evolved, says Daniël Lakens, who led the effort. “It worked like a charm.”

Bart van Overbeeke Fotografie

Nearly 100 scientists spent 2 months on Google Docs to redefine the p-value. Here’s what they came up with

Psychologist Daniël Lakens of Eindhoven University of Technology in the Netherlands is known for speaking his mind, and after he read an article titled “Redefine Statistical Significance” on 22 July 2017, Lakens didn’t pull any punches: “Very disappointed such a large group of smart people would give such horribly bad advice,” he tweeted.

In the paper, posted on the preprint server PsyArXiv, 70 prominent scientists argued in favor of lowering a widely used threshold for statistical significance in experimental studies: The so-called p-value should be below 0.005 instead of the accepted 0.05, as a way to reduce the rate of false positive findings and improve the reproducibility of science. Lakens, 37, thought it was a disastrous idea. A lower α, or significance level, would require much bigger sample sizes, making many studies impossible. Besides. he says, “Why prescribe a single p-value, when science is so diverse?”

Lakens and others will soon publish their own paper to propose an alternative; it was accepted on Monday by Nature Human Behaviour, which published the original paper proposing a lower threshold in September 2017. The content won’t come as a big surprise—a preprint has been up on PsyArXiv for 4 months—but the paper is unique for the way it came about: from 100 scientists around the world, from big names to Ph.D. students, and even a few nonacademics writing and editing in a Google document for 2 months. 

Lakens says he wanted to make the initiative as democratic as possible: “I just allowed anyone who wanted to join and did not approach any famous scientists.”

P-values are a notoriously difficult concept to grasp and are often misinterpreted, but the original paper’s message was clear: A P-value, or α, below 0.05 is much weaker evidence that the results aren’t wrong than people think; lowering it makes studies stronger. After the preprint came out, Lakens created a Google document titled “Justify Your Alpha: A Response to ‘Redefine Statistical Significance’” with 12 discussion points, including “Should we comment on or ignore this recommendation?” and “What are the potential negative effects of this redefinition of statistical significance?” Close to 150 scientists weighed in, and the document ballooned to 100 pages.

The diversity among participants was striking, says Lakens, with less prestigious institutes well-represented, and many contributors sharing their personal experiences. Some argued that they could not afford to set up the large studies needed to meet the new standard or were unable to recruit enough study participants. Some said the lower α could force researchers to resort to so-called “convenience samples,” such as undergrad students, or move studies online. Critics also noted that larger studies are less likely to be replicated, and a more stringent α could make researchers more risk-averse and less likely to take on hard questions.

But perhaps the main argument, the participants agreed, was that 0.005 is just as arbitrary as 0.05, and that the threshold depends on what is already known about a topic and the risks associated with getting a wrong answer. One might accept a higher chance of a false positive result in a preliminary study, for instance, whereas a drug trial might require a lower p-value.

Lakens extracted the gist of the discussions in a new Google document that served as the basis for the paper. “It was incredible to see how the document evolved from there,” he says. “People adding, deleting, and adding again. New discussions appearing in the sidelines. It worked like a charm. People agreed to take on specific tasks, such as fixing the references or checking periods and commas. When we had to shorten the article, a couple of authors became like piranhas, removing everything that was unnecessary.” Lakens processed and integrated much of the new input in breaks from his regular work, during the early morning hours, or late at night. “At a certain moment I thought I was going crazy,” he says. As the draft approached its final version, a few participants dropped out, some because they disagreed with the text; 87 eventually agreed to be a co-author.

Daniel Bradford, a Ph.D. student in clinical psychology at the University of Wisconsin in Madison, was “excited about helping” with the paper. “I had been a longtime student of statistics and I had been joining the waves of discussion of methodological reform in psychology,” he says. Bradford was initially skeptical that the crowdsourcing authorship process would work. “I have collaborated on papers with only five authors and often thought that things would be much more efficient if the author list was even shorter than that,” he says.

The paper recommends that the label “statistically significant” be dropped altogether; instead, researchers should describe and justify their decisions about study design and interpretation of the data, including the statistical threshold. “Sometimes, the α will be 0.05, sometimes 0.005, sometimes 0.10,” Lakens says.

Valen Johnson of Texas A&M University in College Station, who is the lead author of the original “Redefine” paper, says that won’t work. “It is not feasible to allow the authors of every paper to decide on their own definition of statistical significance,” he wrote in an email to Science. “There are simply not enough resources to allow a thorough and unbiased review of each proposed justification of alpha.” It’s unclear how “justifying your α” would work in practice, adds his co-author, Eric-Jan Wagenmakers of the University of Amsterdam.

Another prominent co-author of the original paper is milder. “The paper’s message is perfectly fine from my point of view, and not actually a critique of our paper,” says psychologist Brian Nosek at the University of Virginia in Charlottesville, who directs the Center for Open Science. The “Redefine” paper’s key message was quite limited, he says: The current threshold of 0.05 yields weaker evidence than many people realize, and if it’s going to be dropped, 0.005 is a reasonable alternative. “Other suggestions, such as eliminating significance testing all together, justifying α, incorporating bayesian reasoning, more replication, etc. would be very welcome improvements, too,” Nosek says.

The debate is set to continue—although perhaps not in Google docs. The process was “superawesome” but not very efficient, Lakens says. “You shouldn’t do it when you have little time,” he says. “It’s intense. And we did leave out topics I would have included if I had been the sole author, because we couldn’t reach a consensus.”