Mock REFs need a neutral referee

Consulting citation data would ameliorate the all-too-human shortcomings of departmental review, says a Russell Group professor

March 10, 2020
Source: Getty

If you work at a UK university, your department will currently be using some form of internal review to identify which of your recent papers should be submitted to the research excellence framework later this year.

Unlike some, I don’t have any visceral objection to the REF. Good performance measures generate incentives that motivate staff and promote good work. Nor have I anything against the REF’s design. The criteria of originality, rigour and significance amount to a conceptual framework that is elegant in its simplicity and universal applicability. It is the implementation of that framework that is the problem.

Mock REFs involve departments' assigning colleagues' recent papers and impact case studies a score of one to four stars, with anything less than three considered ineligible for submission. But a commenter under a recent story in Times Higher Education encapsulates the problem: “I had two papers scored by four people last time (two internal, two external). The scores on both? 1, 2, 3 and 4 stars. One external gave two ones, the other two fours. Both were professors at Russell Group universities, in top ranked departments. Clearly my work divides opinion, but to determine someone’s career trajectory based on one score is grossly unfair.”

REF panellists are typically eminent scholars in their fields, but most departments can’t call on internal reviewers with anything like the same experience. The problem of variant scoring can be lessened by training, but the anonymity of the mock REF review process opens the door to huge biases given that the internal reviewer knows all the people being assessed and has various academic and personal relationships – supportive or adversarial – with them. There are other biases too, such as insufficient internal social science reviewers with expertise to assess quantitative work.

Reputations, egos and jobs are on the line, so the review processes are bizarrely politicised and emotive. It is relatively easy to push the score of a paper above or below a critical boundary; even if a second internal reviewer exists, they are probably less specialist and will not put up much of a fight over a well-put case. Moreover, the subtlety of unconscious biases means that sometimes even the reviewers may not realise that they are being more lenient towards someone because they attend departmental socials and smile in the corridor – or because they are close allies of the department head.

Many universities use external reviewers to promote accountability, but the way they are recruited and managed can also reflect bias. Some universities require all studies to be submitted for external review, but senior central staff may have little way of knowing whether departments conform. In other cases, only a sample of papers must be sent for external review, and favourites’ and mentees’ highly internally rated papers might be protected from such scrutiny. Remarkably, some departments even send their external reviewers the internals’ scores and comments, undermining their independence; externals are paid, so don’t bite the hand that feeds.

But what if people who feel undermarked could call on the expertise of field experts from around the world to back them up? Universities, you would think, would routinely consider such evidence. And, to be fair, some have begun to incorporate field-weighted citation impact (FWCI) into their mock REFs. But too many have not.

FWCI gauges a publication’s overall significance, originality and rigour – the REF criteria – relative to other studies in the same field with similar publication dates, based on the citation behaviour of scholars everywhere. Some people object that citations are not always positive, but even a critical reference to someone’s work indicates their contribution because it demonstrates that the work is pushing boundaries (and academic spats are useful when they clear the air).

REF panellists themselves are highly likely to use FWCIs to inform their own decisions, as they should. And while none of the citation indexes are perfect, they could be useful in lots of mock-REF situations.

Take the frustrated colleague whose paper was assigned a two-star rating internally (and was denied external review) despite receiving its journal’s annual award for best paper. Scopus shows an FWCI close to three times the global average, putting it in the 95th percentile. It must surely count as at least three-star.

Another colleague whose major output from an award-winning, research council-funded collaboration with internationally renowned colleagues was internally awarded a “low three-star”, putting it on the borderline for possible REF inclusion. The study has an FWCI score 10 times the global average, putting it in the 99th percentile.

Such discrepancies are not minor, and they have implications for individual careers.

Department heads and REF leads enjoy additional influence from the current processes and will not easily relinquish it. But I’m not suggesting that internal review be completely abandoned. As with most research, triangulation from different sources and angles is always better. But I believe that they should be consulted, as in the REF itself – especially in borderline cases. The fact that all outputs can be ranked on FWCI would allow for clearer, evidence-based demarcation of where borders lie.

Moreover, used at the university-level, FWCIs could identify departments whose proposed REF submission profile differs significantly from that which metrics would suggest is optimal. Departments found to have severe biases would have their internal review processes overhauled.

Used judiciously, FWCIs have the potential to improve the quality of REF submissions while – fingers crossed – reducing bias, bullying and unfairness along the way.

The author is a professor at a Russell Group university.


Print headline: Mini-REFs need a referee

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please Login or Register to read this article.

Related articles

Related universities

Reader's comments (2)

While a Research Dean I had to deal with issues related to this. I was never a fan of peer assessment and actively fought against it despite the institution's demand that we needed to 'peer' assess every piece of work. To have articles go through several rounds of reviews to be accepted in a journal only to have yet more people spend time determining 'quality' was, in my view, a combination of overkill and demeaning. Also, it too what was fundamentally a group assessment -- the REF is a collective assessment of an area of work -- and made it an individual assessment of the quality of an individual's scholarship -- which is not meant to be. It is also a huge waste of resources to no good purpose. It is economically, scientifically and practically invalid. From an economic perspective institutions spend somewhere on the order of £500M worth of people's time doing REF related activities. These typically involve the most research active people who, frankly, have better things to do. We then create panels that go through the same process again, wasting even more time and money (although some people believe that being a REF panel member is apparently a good thing to have on one's resume if one is into wanting to appear involved in policy). Imagine if this time and money was actually spent doing productive scholarship. From a scientific perspective it is well known that human judgement is fallible. The work of people like Einhorn and Hogarth from decades ago showed how expert judgement can be outdone by statistical modeling and that reliance on such expert judgement can lead to seriously sub-optimal outcomes. Yet institutions which clearly did not learn these lessons insist on relying on such fallacious judgement. To make matters worse, when relying on human judgement, one needs governance mechanisms to counter the bias introduced -- hence, we waste even more resources documenting things so that we can ensure that bias is minimized. Yet as Don Moore's work has shown such mechanisms invariably fail sometimes leading to even more breaches when they come to fruition. Finally, peer assessment is not really practical and can easily be replaced. While the author argues for FWCI this only really works when the half-life of citations is at the low end (as in many STEM fields). In my field (business & social sciences) the half-life of citations is > 10 years, so FWCI doesn't really work well. However, what does work well is recognizing that the REF is a collective exercise. What matters is not any one paper or any one individual but the collective distribution (remember each institution will be submitting many papers in a UOA). While work on the Australian equivalent of the REF as well as the last REF round, we found that if one took each article (excluding editorials and commentaries) and simply weighted it by the 5-year citation impact factor for the journal it was published in and summed that up according to the rules that applied, one could predict the outcomes with more than 90% accuracy. Indeed, such a rule allowed for optimization of the final metric (remember the point is to come up with a single score) and hence a complete removal of the need for peer assessment. I know that people will complain with arguments that articles are more than just the journal in which they are published. While that matters to you as a person it does not matter to the collective assessment unless you believe that your group is being biased against such that all of your articles are better than the journals in which they are published. The rule I gave above simply assumes that every paper be treated as the average of the papers published in that journal in that year. Sure your paper might be better than average, but it also might be worse. When you have hundreds of papers being submitted the law of averages takes over. That is the beauty of models over people. Models don't feel slighted. Models don't have egos. The role of models is to predict. Let's let models do their job so that we can get back to doing ours.
REF was never about a honest and accurate assessment of research quality. If it was, there would had been effective steps undertaken to correct for this, which includes, the use of metrics to supplement or adjust reviewer ratings, an assessment of individual reviewer bias/consistency and correct for this, double blind submission of outputs, inclusion of value for money adjustments, adjustments for type of research output and theoretical/applied output biases, adjustments based on gender and ethnic variability etc etc... All these are effective steps that can be undertaken to improve the accuracy and reduce bias but what did we get? Unconscious bias training - something that has no documented validity to work. Personally, I think ALOT of effort and wasted money could be saved by just simply going back to using a research metric system (e.g., citation counts). I think people are just too personally invested to be unbiased and independent reviewers for REF - doesn't matter how much equality/diversity training you subject them to.