Hoax-detecting software spots fake papers

Andrey Voskressenskiy/iStock

Hoax-detecting software spots fake papers

It all started as a prank in 2005. Three computer science Ph.D. students at the Massachusetts Institute of Technology—Jeremy Stribling, Max Krohn, and Dan Aguayo—created a program to generate nonsensical computer science research papers. The goal, says Stribling, now a software engineer in Palo Alto, California, was “to expose the lack of peer review at low-quality conferences that essentially scam researchers with publication and conference fees.”

The program—dubbed SCIgen—soon found users across the globe, and before long its automatically generated creations were being accepted by scientific conferences and published in purportedly peer-reviewed journals. But SCIgen may have finally met its match. Academic publisher Springer this week is releasing SciDetect, an open-source program to automatically detect automatically generated papers.

SCIgen uses a “context-free grammar” to create word salad that looks like reasonable text from a distance but is easily spotted as nonsense by a human reader. For example:

Cyberneticists agree that semantic modalities are an interesting new topic in the field of programming languages, and theorists concur. This is a direct result of the development of web browsers. After years of compelling research into access points, we confirm the visualization of kernels. Amphibious approaches are particularly theoretical when it comes to the refinement of massive multiplayer online role-playing games.

SCIgen also generates impressive-looking but meaningless data plots, flow charts, and citations. The trio named SCIgen in honor of the World Multi-Conference on Systemics, Cybernetics, and Informatics (WMSCI), an annual event that they suspected was fraudulently claiming to use human peer reviewers to vet submissions. Indeed, two of their nonsense papers were accepted by WMSCI.

The trio then put SCIgen online as a free service, encouraging researchers to “auto-generate submissions to conferences that you suspect might have very low submission standards.” And submit they did. Over the past decade, researchers have pulled numerous pranks on journals and conferences that claim to use human peer reviewers. Variations on SCIgen have appeared for other fields, from mathematics to postmodern theory. (This author continued the tradition, but using a different fake paper-generating method.)

The pranks were tolerated by publishers until 2013, when 85 SCIgen papers were discovered in the published proceedings of 24 different computer science conferences between 2008 and 2011. More were soon discovered, and 122 nonsense conference papers were ultimately retracted by Springer, the academic publishing giant based in Heidelberg, Germany, and by the Institute of Electrical and Electronic Engineers, based in New York City.

Rather than being created as pranks, it seems that many of the fake papers were coming from China where they were “bought by academics and students” to pad their publication records, says the lead researcher behind the investigation, Cyril Labbé, a computer scientist at Joseph Fourier University in Grenoble, France. Later that year, an investigation by Science uncovered an underground market for fake academic credentials, in which some peddlers may have used SCIgen to save themselves the effort of writing “authentic” fake papers by hand.

In the wake of that public relations nightmare, Springer approached Labbé for help. His method for finding the nonsense papers was sophisticated, requiring a statistical technique similar to spam e-mail detection, but based on grammatical patterns rather than on keywords like “Viagra.” He agreed, for a price.

The outcome of that deal was revealed in Springer’s 23 March press release. It announces the public release of SciDetect, a program created by Labbé’s research group to automatically detect papers created with SCIgen and similar programs. Its purpose, according to Springer, is to “ensure that unfair methods and quick cheats do not go unnoticed.” When asked how much money Springer paid Labbé’s team, a representative replied that “unfortunately we cannot provide you with financial figures,” but noted that it was enough to fund a 3-year Ph.D. student in Labbé’s lab.

But some see SciDetect as a tool for avoiding embarrassment rather than catching fraudsters. “As someone who used SCIgen to expose the lack of editorial and peer review of a suspect journal, anyone with a modicum of English language proficiency should be able to detect a paper written by SCIgen or similar software,” says Philip Davis, an independent researcher who consults for the publishing industry. “To me, this appears to be a move by a publisher to protect itself against the unwillingness of journal editors to weed out these fraudulent papers themselves.” Or as Paul Ginsparg, the founder of arXiv and an already freely available algorithm for detecting gibberish, says, “It's wonderful that Springer has moved to eliminate articles generated by software that intentionally produces nonsense, but what about unintentionally nonsensical articles produced by human authors?”

In an e-mail exchange with Science, the Springer representative wrote, “We agree with what Cyril Labbé says in his quote [in the press release]:  ‘Software cannot replace peer reviews and academic evaluation, but SciDetect lends publishers an additional hand in the fight against fraud and fake papers.’ ” She added that no SCIgen gibberish articles have been submitted to Springer conferences or journals since the 2013 retractions.

As for the pranksters, they will just have to work harder, says Stribling, the SCIgen creator. “I’m willing to bet if someone wanted to declare an arms race, they could come up with another way to generate papers that would fool [SciDetect] again for a while.”

Follow News from Science

Latest News

A 3D plot from a model of the Ebola risk faced at different West African regions over time.
dancing shoes