Experimenting with plagiarism detection on the arXiv
March 2007, page 30
Starting this summer, submissions to the arXiv, the
online server where many physicists check daily for new preprints, will be compared with the server's
existing 400 000and countingmanuscripts to check for plagiarism.
When plagiarism is suspected, the submission
will be flagged, and the authors will get a message saying "your article has x% overlap with
article 'a.' Do you really want to do this?" says Cornell University physicist Paul Ginsparg, the
creator and overseer of the arXiv. The authors whose papers were copied from will not be notified.
"This will be a fun experiment,"
Ginsparg says. "Will we train people to be more clever and to make more word changes? Or will there
be a real change in their behavior?"
Behavior did change when
University of Virginia physicist Louis Bloomfield began using software to see if his students
were cheating. Checking new arXiv submissions is a good idea, Bloomfield says. "People should
know it's not okay to steal. It's not even okay to publish your own stuff over and over." After he reported
students who had copied, they were prosecuted. Forty-five students either left the university
or were found guilty, and three degrees were revoked. "I was immersed in seemingly endless honor
trials. Two years of my life were burned up. There's a lot of trouble when you open this can of worms.
Plagiarism shouldn't be tolerated, but you need a professional organization to handle the heat."
The arXiv's automated
scanning for overlapping text is a refinement of an algorithm used last year by Cornell computer
science graduate student Daria Sorokina to look at the server's then nearly 300 000 documents.
The algorithm assigns unique numbers to word sequences and then compares those numbers across
documents. Common phrases such as "this work was supported in part by" are excluded. "There is nothing
new about document fingerprinting," says Cornell computer scientist Johannes Gehrke, an adviser
on the project. "The novelty here was the application to the arXiv."
In the study, about 10%
of arXiv manuscripts had text blocks that overlapped with other documents. After removing instances
of authors reusing parts of their own text, different collaborators on a single project using the
same text in separate conference abstracts, and other apparent false positives, less than 1% of
manuscripts were still suspect, says Sorokina.
Close examination of 20
pairs of documents with among the highest levels of overlap exposed 16 as plagiarism. "In one case,
an author copied descriptions of five or six methods that he was comparing," says Sorokina. "He
didn't cite the sources. But the work of comparing was his own." One of the most common types of plagiarism
found was the lifting of introductory or background material, especially in PhD theses, says Ginsparg.
"The surprising thing is that people submit to the same database where they found [what they copied].
It's mind boggling, given the existence of Google, given the existence of searching on full text,
that people wouldn't have an intuition that they would be caught."
"Some of it is different
ethical norms," Ginsparg adds. "People in different countries, with different intellectual
backgrounds, will sometimes argue that what they are doing is completely correct." The reassuring
thing, he adds, "is that the most creative people, who are generating the ideas, don't have to start
from someone else's article as a template. We'd be very surprised if authors of prominence showed
up as perpetrators as opposed to victims."
Document fingerprinting
catches only word-for-word plagiarism. But work is under way in the data-mining community on author
identification and detection of the flow of ideas, says Gehrke. "Detecting content-based similarities
with more sophisticated methods on a macroscale will be the next step."
In addition to implementing
a check on new submissions to the arXiv, Ginsparg is talking to the editors of Physical Review
Letters about applying the method to it and other American Physical Society publications.
"More work needs to be done to include papers outside of the arXiv, and to go across journals," says
Marty Blume, the recently retired APS editor-in-chief. "We have 30 000 submissions a year.
We'll have to see how much [of the editors'] time it takes to run. And if we do it, what do we do with the
results?"