Study of massive preprint archive hints at the geography of plagiarism

New analyses of the hundreds of thousands of technical manuscripts submitted to arXiv, the repository of digital preprint articles, are offering some intriguing insights into the consequences—and geography—of scientific plagiarism. It appears that copying text from other papers is more common in some nations than others, but the outcome is generally the same for authors who copy extensively: Their papers don’t get cited much.

Since its founding in 1991, arXiv has become the world's largest venue for sharing findings in physics, math, and other mathematical fields. It publishes hundreds of papers daily and is fast approaching its millionth submission. Anyone can send in a paper, and submissions don’t get full peer review. However, the papers do go through a quality-control process. The final check is a computer program that compares the paper's text with the text of every other paper already published on arXiv. The goal is to flag papers that have a high likelihood of having plagiarized published work.

"Text overlap" is the technical term, and sometimes it turns out to be innocent. For example, a review article might quote generously from a paper the author cites, or the author might recycle and slightly update sentences from their own previous work. The arXiv plagiarism detector gives such papers a pass. "It's a fairly sophisticated machine learning logistic classifier," says arXiv founder Paul Ginsparg, a physicist at Cornell University. "It has special ways of detecting block quotes, italicized text, text in quotation marks, as well statements of mathematical theorems, to avoid false positives."

Only when there is no obvious reason for an author to have copied significant chunks of text from already published work—particularly if that previous work is not cited and has no overlap in authorship—does the software affix a “flag” to the article, including links to the papers from which it has text overlap. That standard “is much more lenient" than those used by most scientific journals, Ginsparg says.

To explore some of the consequences of "text reuse," Ginsparg and Cornell physics Ph.D. student Daniel Citron compared the text from each of the 757,000 articles submitted to arXiv between 1991 and 2012. The headline from that study, published Monday in the Proceedings of the National Academy of Sciences (PNAS) is that the more text a paper poaches from already published work, the less frequently that paper tends to be cited. (The full paper is also available for free on arXiv.) It also found that text reuse is surprisingly common. After filtering out review articles and legitimate quoting, about one in 16 arXiv authors were found to have copied long phrases and sentences from their own previously published work that add up to about the same amount of text as this entire article. More worryingly, about one out of every 1000 of the submitting authors copied the equivalent of a paragraph's worth of text from other people's papers without citing them.

So where in the world is all this text reuse happening? Conspicuously missing from the PNAS paper is a global map of potential plagiarism. Whenever an author submits a paper to arXiv, the author declares his or her country of residence. So it should be possible to reveal which countries have the highest proportion of plagiarists. The reason no map was included, Ginsparg told ScienceInsider, is that all the text overlap detected in their study is not necessarily plagiarism.

Ginsparg did agree, however, to share arXiv’s flagging data with ScienceInsider. Since 1 August 2011, when arXiv began systematically flagging for text overlap, 106,262 authors from 151 nations have submitted a total of 301,759 articles. (Each paper can have many more co-authors.) Overall, 3.2% (9591) of the papers were flagged. It's not just papers submitted en masse by a few bad apples, either. Those flagged papers came from 6% (6737) of the submitting authors. Put another way, one out of every 16 researchers who have submitted a paper to arXiv since August 2011 has been flagged by the plagiarism detector at least once.

The map above, prepared by ScienceInsider, takes a conservative approach. It shows only the incidence of flagged authors for the 57 nations with at least 100 submitted papers, to minimize distortion from small sample sizes. (In Ethiopia, for example, there are only three submitting authors and two of them have been flagged.)

Researchers from countries that submit the lion's share of arXiv papers—the United States, Canada, and a small number of industrialized countries in Europe and Asia—tend to plagiarize less often than researchers elsewhere. For example, more than 20% (38 of 186) of authors who submitted papers from Bulgaria were flagged, more than eight times the proportion from New Zealand (five of 207). In Japan, about 6% (269 of 4759) of submitting authors were flagged, compared with over 15% (164 out of 1054) from Iran.

Such disparities may be due in part to different academic cultures, Ginsparg and Citron say in their PNAS study. They chalk up scientific plagiarism to "differences in academic infrastructure and mentoring, or incentives that emphasize quantity of publication over quality."

