In 2015, Nick Brown was skimming Twitter when something caught his eye. A tweet mentioned an article by Nicolas Guéguen, a French psychologist with a penchant for publishing titillating findings about human behavior, for example that women with large breasts get more invitations to dance at nightclubs, or blond waitresses get bigger tips. Now, Guéguen was reporting that men are less likely to assist women who tie up their hair.
Brown, a graduate student in psychology at the University of Groningen in the Netherlands, sent an email about the study to James Heathers, a postdoc in behavioral science at Northeastern University in Boston whom he had met a few years earlier. The description alone triggered a laughing spell in Heathers—not an uncommon reaction to science he finds risible.
Once the chuckling stopped, Brown and Heathers took a deeper look at the findings reported by Guéguen, who works at the University of Southern Brittany in Vannes, France. Many failed the duo’s homegrown test for statistical rigor. The pair also found odd data in nine other articles from Guéguen.
Soon, Brown and Heathers were asking Guéguen and the French Psychological Society about the numbers. The pair says Guéguen failed to adequately address their questions, and the society agreed that their critique seemed well-grounded. So late last year, the men did something that many scientists might find out of bounds: They went public, sharing their concerns with a reporter for Ars Technica, which published a story, and posting their critiques on a blog. (Guéguen declined to discuss the matter with Science; the society says a university panel is examining the papers.)
When it comes to correcting scientific literature, styles vary. Some scientists prefer to go through “proper channels,” such as private conversations or letters to the editor. Others leave anonymous comments on online forums, such as PubPeer, set up to discuss papers. Then there is the more public approach Brown and Heathers are taking.
The two watchdogs have been remarkably effective at uncovering problematic publications. So far, Brown estimates that the analyses he and Heathers have done, sometimes working independently and often with other collaborators, have led to corrections to dozens of papers and the full retraction of roughly 10 more. That total includes five papers retracted over the past year or so by Brian Wansink, a high-profile nutrition researcher at Cornell University.
In short, peer review misses all the hard stuff, and a worrying amount of the easy stuff.
The duo concedes that their assertive style might rub some scientists the wrong way. Heathers, who has called himself “a data thug,” notes that in academia “the squeaky wheel gets Tasered.” But other researchers laud the pair as the vanguard of a movement to make science more rigorous. “Without people like them actively scouring the literature [it is easy for] misbehavior to go unnoticed,” says psychologist Brian Nosek of the University of Virginia in Charlottesville and founder of the Center for Open Science, which promotes replication efforts. “I might see a paper and say, ‘I don’t believe that,’ and put it aside. They’re willing to pursue it … wherever it goes.”
As a result, Brown and Heathers have become go-to whistleblowers of a sort: Researchers now pepper the pair with tips about suspect papers. “It’s like a work tumor we’ve been infected with,” Heathers says. “You talk about these things in public and all of a sudden people start bringing these things to you.”
An unlikely duo
The partnership is unlikely, an odd-couple pairing made possible only by the sociological blender that is the internet.
Brown, 57, is a demure ex-pat raised in Birmingham, the second largest city in the United Kingdom. He started out as an engineering student at the University of Cambridge, but “my maths weren’t good enough,” he says, and he wound up working as a computer network manager.
That position led to one of two encounters that Brown says fostered his current pursuits. While attending a human resources conference he met a U.K. psychologist, Richard Wiseman, who in 2010 wrote an influential blog post critiquing the methods of a soon-to-be published study by Cornell psychologist Daryl Bem purporting to demonstrate the existence of extrasensory perception. Brown says the chance meeting “planted the seed” that led him to seek a master’s degree in psychology at the University of East London.
That coursework led to his first successful scientific debunking. In 2013, he teamed up with two other academics, one of whom was Alan Sokal of New York University in New York City, a mathematician and physicist who in 1996 perpetrated one of the most famous scientific hoaxes ever by getting a cultural studies journal to publish a gobbledygook paper. They published a critique of a leading paper on psychological theory, noting that the formulas in the paper relied on irrelevant equations from the field of fluid dynamics. Their article eventually triggered the paper’s partial retraction.
Hey, we found these problems with this article in your journal, you might want to check it out.
The episode showed Brown that well-crafted arguments about flawed research are hard for editors to ignore. In 2014, he waded into another controversy, this one surrounding Diederik Stapel, a notorious fraudster from Tilburg University in the Netherlands who fabricated dozens of studies in the field of behavioral psychology. This time, Brown jailbroke Stapel’s memoir by translating it into English and posting the document on a public website. The memoir offered Stapel’s take on the fraud, but was available only in Dutch behind an online paywall.
Brown’s second pivotal encounter came that same year with Heathers, through a Facebook group that had been discussing concerns with a paper on heart rate variability, a topic Heathers had once studied.
The two are temperamental opposites. Whereas Brown does not consider himself a crusader and has no trace of the swashbuckler, the Australian-born Heathers, 35, revels in acting like a dinner guest who farts loudly—and unapologetically—during grace. And he appears constitutionally unable to accept authority.
As an undergraduate at The University of Sydney (where he also earned master’s and doctorate degrees), Heathers began studying economics, but switched to physiology, “which I didn’t like because I got into fights with the lecturers all the time when they said stuff I didn’t agree with.” So, he shifted again, to psychology. Then his supervisor’s mother was murdered; the tragedy removed his mentor from the scene and effectively left Heathers with no one to guide him in the niceties of academia. “I spent most of the next 3 years trying to figure things out by myself,” Heathers says. “I had no sense of normal academic parameters or what I should be doing.”
Heathers drifted back to physiology, this time with an emphasis on what he loosely calls “data stuff.” Simply put, that means critiquing the results of other researchers. “My normal day at work, when I’m not doing metascience, is retrieving measurements from other people” and telling them where they’ve gone astray, he says. “I’m usually the bearer of bad news.”
When the two work together, usually one party is clearly in the lead. “We don’t turn up as the dynamic duo,” Brown says. “It’s usually an 80/20 or 90/10 split. Sometimes I’m doing most of the work and James is just there to put up with my profanity and check I’m not going down a rabbit hole, and vice versa.”
A mathematical approach
The rudiments of the duo’s mathematical approach began to take shape in 2015, as they began to jointly examine Guéguen’s papers.
One technique, which they describe in a May 2016 posting on Heathers’s blog, looks at what the researchers call the granularity-related inconsistency of means, or GRIM. The essence of the test is the disarmingly simple fact that the mean value in a collection of N integers must be a fraction whose denominator is N. For example, it might seem plausible if a researcher running a study involving 12 children aged 11 to 17 reports a mean age of 15.7. (After all, it’s less than 17 and greater than 11.) But the GRIM test reveals that value is mathematically impossible, because 15.7 is not a number that can be produced by dividing the sum of the ages by 12.
The GRIM test is little more than “glorified adding up,” says Heathers, and the two don’t see it primarily as a way of detecting misconduct. Rather, “We are looking for mistakes. That’s literally it.” The GRIM is ideal for spotting errors in psychology and other fields that report results from small samples, its creators note. But they readily acknowledge that it doesn’t work for large studies and more complex data sets.
For those, Heathers came up with a somewhat more sophisticated test, which he and Brown call sample parameter reconstruction via iterative techniques, or SPRITE. In essence, SPRITE allows the researchers to do some reverse engineering: deriving statistically possible data sets from the means and standard deviations reported in a study.
There’s no guide to spotting errors. There’s no text that you can read. What we have done so far has been quite ad hoc.
SPRITE has figured heavily in the pair’s analyses of papers by Cornell’s Wansink, who received extensive media coverage for his studies of nutrition and eating habits. The test showed that the data in at least one of his studies—a 2012 article in the journal Preventive Medicine looking at carrot consumption among school children—appeared iffy. How so? Running the published data through SPRITE showed that at least one child in the sample would have had to have eaten roughly 60 carrots in a single sitting. That amount, Heathers quips, is more appropriate for a cart horse than a child. (Wansink, who declined to be interviewed for this story, recently published a lengthy correction to the paper, stating it measured “matchstick carrots,” four of which are equivalent to one baby carrot.)
Despite the charged nature of their work—after all, careers can be on the line—Brown and Heathers have attracted surprisingly little criticism from their peers in science. In part, that’s likely because of their strategy of gently but methodically ratcheting up the pressure on authors and journals. For example, the Wansink analysis, like others the pair has undertaken, began with “some very polite” emails asking for data from the researcher’s department, Brown says, as well as from Cornell’s Office of Research Integrity and Assurance. “But both of those stopped replying to us once—we assume—our questions became too awkward.” (The university has said it is investigating the papers.) At about the same time, Brown and two collaborators—Jordan Anaya and Tim van der Zee, a graduate student at Leiden University in the Netherlands—were working on a preprint that they posted in PeerJ titled “Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab.”
“Once we had our preprint online, we moved to blogging about new issues as they arose,” Brown recalls. “At one point I wrote three blogs posts in as many weeks and someone commented, ‘Ah, I see it’s Wansink Wednesday.’” By early 2017, after the media picked up on the affair, Brown says, “We felt we had established enough questions to start writing to the journal editors.” But he says even those emails tended to take a sabers-sheathed tone: “‘Hey, we found these problems with this article in your journal, you might want to check it out.’ I thought that was less provocative than an outright demand for a retraction.”
Soon, however, Wansink had begun retracting five papers and correcting more than a dozen. That unusual turn of events prompted Brown to launch a Twitter poll, asking his followers whether he should add the results of his data sleuthing to his resume. The responses were decidedly mixed: Thirty-five percent said it was “cool,” whereas 24% chose “WTF.” The other options—“OK, I guess,” and “Cheesy”—earned 25% and 16% of the votes, respectively.
Editors are listening
Some editors and publishers are clearly paying attention. The Wansink episode made “clear to us … that online fora for postpublication discussion are a valuable part of the scientific record,” says Gearóid Ó Faoleán, the London-based ethics and integrity manager for Frontiers in Lausanne, Switzerland; one of its journals retracted a 2016 paper by Wansink after being alerted to the PeerJ preprint and Brown’s blog. The preprint also prompted editors at the Journal of Sensory Studies to investigate a 2014 Wansink paper they had published, and ask for a correction, which the researcher made last August.
Other would-be data whistleblowers are impressed by the duo’s success in getting journals to act. Paul Brookes, of the University of Rochester Medical Center in New York, briefly used his now-defunct blog, science-fraud.org, to highlight questionable papers anonymously. Brookes, who was outed amid legal threats in 2013 after only 6 months, says he would “routinely write dozens of emails [to journal editors], and it was common to have no response at all.”
Elisabeth Bik, now a science editor at uBiome in San Francisco, California, experienced similar silence. When Bik was a researcher at Stanford University in Palo Alto, California, she spearheaded an analysis of 20,000 papers, published in mBio in 2016, that concluded that 4%, or 800, contained inappropriately manipulated images. She contacted most of the journals involved more than 2 years ago, but only about one-third have responded.
It’s not clear why Brown and Heathers have gotten a better response—it may be that the overall climate has changed—but they say just about anyone with rudimentary math skills and a willingness to go public could replicate what they are doing. So why aren’t more scientists following suit?
One big obstacle, they say, is that many are reluctant to rock the boat. “Some people have a block on criticizing others, even to themselves,” Brown says. Their reaction to evident problems is to flinch, as if a scientific superego is saying: “Am I allowed to get this professor’s article and read it? And will something bad happen to me if I recalculate the mean?”
Another hurdle is an overabundance of trust. “Other people really sort of lack the mindset that this might even be necessary,” Heathers says. “There’s no guide to spotting errors. There’s no text that you can read. What we have done so far has been quite ad hoc.”
The pair also admits it enjoys luxuries—time and freedom—that many scientists with pressing grants and academic appointments do not have. “When you have funding, people expect reports on what you’ve done with their money,” Heathers says. “They also don’t expect you to investigate them.”
Still, Heathers believes the sleuthing efforts the two have developed are the “thin end” of a growing wedge of analytic techniques that, once refined, can be formalized and taught to anyone. Eventually, he would like to produce a scalable, online course to spread the methods. “Then things get really interesting,” he predicts, in part because traditional peer review has failed to catch so many problems. “In short, peer review misses all the hard stuff, and a worrying amount of the easy stuff.”
For the moment, however, the pair is content to have helped start a conversation about publicly confronting potentially problematic results, however uncomfortable it might be for some researchers—and science writ large. “The black flag has been hoisted,” Heathers has written. “It isn’t coming down.”
Adam Marcus and Ivan Oransky are co-founders of Retraction Watch. This story is the product of a collaboration between Science and Retraction Watch.
*Correction, 14 February 2018, 5:07 p.m.: An earlier version of this article misstated the estimated number of retracted papers. Nick Brown worked as the manager of computer networks. James Heathers is 35.