Computer programs can catch out copycat students but the smart internet plagiarist is still difficult to detect

September 3, 1999

You have 200 students in your year group. They are all connected to each other by email. They are all connected to the internet. They are all intelligent. How can you be sure that their intelligence is not being applied to lighten their load by sharing their work or what they find while surfing?

The answer is almost certainly that you cannot. With this number of students even double marking is not guaranteed to find strong resemblances between essays.

This is the reason for the development of CopyCatch. It can "read" all 200 essays in a very short time and report back where the vocabulary used is very similar. The level of what is suspicious will vary. For example, a laboratory report will inevitably contain a higher degree of intrinsic similarity than a general English essay topic. In the reporting or free discussion essays of up to 2,000 words that have been examined so far, the level of natural overlap appears to be 30-40 per cent. Setting a threshold of similarity at 70 per cent and finding pairs of essays on the report should cause alarm bells to ring.

The program works by exploiting the fact that the majority of the words in an essay will be used only once. These words express the writer's view of the question. Since we expect each student to have their own viewpoint, even on technical or specialised subjects, if a large number of the words used once appear in two essays it seems reasonable to expect that not only the opinions but the phrases will be similar. If this is the case then the question can be asked "How did this happen?" Having read something similar before is what triggers suspicions of plagiarism in human readers. The computer is simply reproducing this.

Any pair of essays caught in this way needs to be examined more closely. Words occurring more than once, but exactly the same number of times, are further indicators of potential similarity, but to be sure of this you need to be able to see the phrases side by side, so the program reports on those phrases that repeat either inside one essay or across essays.

If the repeat phrases are inside a single essay then the student is probably repeating the question data or repeating an opinion. This is probably poor writing but not suspicious.

If they come from different essays then they may be shared quotes. Initial studies suggest that even this is less common than might be thought. There is no particular reason why one quote should be seen as key or relevant by a number of students, even if a passage has been strongly underlined in textbooks by "helpful" predecessors.

If the phrases appear to be commentary on the question and there are a number of such instances, then there is strong evidence of collusion or plagiarism. This is even more probable if they are found to be in the same paragraph or sequence.

What a CopyCatch user decides to do with this information will vary, but at least the information is available, and rapidly. It is important to realise that this does not help with, or replace, marking of student's work, it simply suggests that a closer comparison might be sensible before awarding a final mark.

It is also important to understand that this sort of program cannot find suspect material on the internet where only one student is suspected of making use of such data. In particular, many internet essays are available only on password or payment, so there is little prospect of a full protection system as yet. Similarly, if a student obtains material from posting a question to a special interest group, there is little chance of finding this automatically unless it is shared with another. This sort of plagiarism is normally spotted by the marker's unease with the style of writing or sophistication of content. It is also the source of most of the consultancy work in disputed cases.

Programs such as this are designed to counter a problem that has only recently become serious. If students can find material on the internet and share it, or send whole essays or answers to each other, then some of them assuredly will. For the protection of the majority who do not, and for assurance of the integrity of the degrees awarded, it seems necessary for universities to apply electronic solutions to electronic problems. Many already are, particularly in computer programming, but it needs to become the norm. Clear information about acceptable and unacceptable use of electronic forms of information is vital, but so is the back-up of electronic monitoring, because the scope of the problem exceeds the human controls currently in place.

David Woolls is software developer at CFL Software

copycatch.freeserve.co.uk

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Sponsored

Featured jobs