Students are discovering that multiple-choice questions miss the mark. Richard Burton and David Miller explain why.
Ever-larger classes mean that Dr McQue is increasingly dependent on multiple-choice examinations, which he prepares and administers with skill and care. He has always been pleased with the precision, objectivity and intellectual variety of questions he uses.
However, he was approached by two students who were bewildered. One had been graded B, the other D - despite the fact that they studied together and believed themselves to be identical in their knowledge of his course.
Dr McQue began to wonder if the fault might not be in their judgement, but in his examination. The tests consisted of 60 multiple-choice questions (MCQs) with a choice of four possible answers. There was no deduction of marks for wrong answers, that is "negative" or "logical" marking used to discourage blind guessing. Dr McQue ensured that every student answered every question.
Most students would gain some marks by guessing, so each score would contain a random component. He had always made an allowance for this in interpreting scores, but all that he could calculate, for each number of known answers, was the average extra score to be obtained by guessing.
As none of his colleagues was concerned about the effects of guessing, Dr McQue had given it little thought.
But the students' complaint led him to consult Dr Everage, a statistician. After playing with students' scores at her PC, she told him that about one in three individuals was probably being allocated the wrong grade just through the vagaries of guessing.
Dr Everage proposed some more general measures of test unreliability, which she explained to Dr McQue.
As Dr McQue already realised, students knowing half of the 60 answers and guessing the others at random (with a one-in-four chance of being right) would score on average 37.5 (ie, 30 + 30/4) or 62.5 per cent. However, the anticipated range of scores is wide enough that one in five of those students would score less than 35 or more than 39.
Such is this spread of likely scores for the same amount of knowledge that it is impossible to distinguish reliably even between students knowing 50 per cent of the answers and students knowing 60 per cent - which Dr McQue believes merit different grades.
How much less reliably could one discriminate all intervening levels of knowledge?
One way of quantifying the difficulty of distinguishing between students with 50 per cent and 60 per cent knowledge is to say that the frequency (or probability) distributions for students in the two groups, given equal numbers of each, would overlap by one-third (31.2 per cent). Of those same students, half that percentage would be likely to achieve scores more appropriate to the wrong group (15.6 per cent, "measure A").
Alternatively, a student with 50 per cent knowledge has a probability of about one in ten (0.104) of equalling or exceeding the score of a given student with 60 per cent knowledge ("measure B").
Colleagues unwilling to confront the implications were quick to defend their practices. They suggested that Dr Everage's statistical model was based on a falsehood: students are not blindly guessing if they can rule out one or more of the four answers.
Dr Everage countered that the effect of such "partial knowledge", which must generally be to raise scores, increases the probability of guessing an answer correctly from one in four to either one in three or one in two and that broadens the spread of likely scores, so reducing test reliability.
The merit of partial knowledge is a matter of debate. For example, there is value in knowing not to administer castor oil for a headache, but how much better is it to confidently choose an aspirin?
What of a student who knew no answers at all, but who could rule out two of the four possible answers for each question? The most probable score would be 50 per cent (perhaps leading to an acceptable grade), but higher scores would be likely too.
One way of improving the reliability of the test would be to increase either the number of possible answers per question or the length of the test.
However, merely to halve measures A and B, it would be necessary either to increase, respectively, the number of answer choices to seven and six (which is often impossible in practice), or else to increase the number of questions to about 114 and 92 respectively.
Dr McQue decided instead to minimise the guessing element by the well-established method of deducting marks for wrong answers (a third of a mark for each wrong answer in a four-choice test).
Then a colleague pointed out another problem. Dr McQue's 60 questions drew on only a tiny fraction of the facts and ideas in his course. Some students would be lucky in his selection and some would be unlucky.
With both random elements involved, no more than about half of the class could be correctly graded.
Measure A would rise from 15.6 per cent to 26.9 per cent and measure B would rise from 0.104 to 0.219. Without guessing, the two measures would be 21.8 per cent and 0.155 respectively.
So Dr McQue decided not only to minimise guessing by means of "negative" marking, but also to use many more questions, so more of the course material could be included. These questions would be of the true/false type, so that the times required for preparing and answering the test would not be increased too much.
The use of true/false questions, testing just one concept at a time, would reduce most "partial knowledge" to simple uncertainty.
It would also provide unequivocal feedback to his teaching staff on what the students understood best and worst.
Retaining the same objectivity and variety of questions, the test would be much more of a precision instrument and the results would be fairer. Would the students see it that way and thank him and Dr Everage for their concern? That's anybody's guess.
Richard Burton is senior lecturer, and David Miller is reader, Institute of Biomedical and Life Sciences, University of Glasgow.