Can the research excellence framework run on metrics?

An Elsevier analysis explores the viability of a ‘smarter and cheaper’ model

June 18, 2015
Man measuring bar graphs with tape measure

The current research excellence framework is “a bloated boondoggle” that “steals years, and possibly centuries, of staff time that could be put to better use, and includes so many outcome measures that every university can cherry-pick its way to appearing ‘top-ranking’ ”.

This is the view of Chris Chambers, professor of cognitive neuroscience at Cardiff University, and it is by no means unique.

So what if much of that labour could be replaced by automated metrics? Academics’ conviction that peer review is the gold standard of research assessment means that evaluating the viability of a replacement scheme hinges on how closely metrics could reproduce the conclusions reached by humans. The debate was invigorated in 2013 when Dorothy Bishop, professor of developmental neuropsychology at the University of Oxford, argued in a widely read blog that a psychology department’s h‑index – a measure of the volume and citation performance of its papers – during the assessment period of the 2008 research assessment exercise was a good predictor of how much quality-related funding it received based on its results. But the suggestion that this might also be true in other disciplines was dealt a blow earlier this year when a team of physicists found that departmental h-index failed utterly to predict the REF results.

Times Higher Education asked Elsevier, owner of the Scopus database and provider of citation data for our World University Rankings, to carry out its own analysis of correlations between citations data and quality scores in the REF.

The analysts reasoned that examining the metrics of only the papers submitted to the REF would fail to demonstrate the value of replacing the REF with an exercise in which university departments would be relieved of the labour of selecting which papers to submit.

So instead they examined the proportion of an institution’s entire output over the REF assessment period (2008 to 2013) that fell into the top 5 per cent of all articles by citation produced during that period by the fields covered by the relevant unit of assessment (papers were credited to the institution where they were produced even when the author had moved before the census date).

This figure was compared with the proportion of the institution’s outputs that were judged to be 4* (“world-leading”) in the REF. They found a reasonable “coefficient of determination” of 0.59, where 1 is a perfect linear relationship. However, as the graph below shows, there are plenty of outliers.

The analysts then looked in more detail at correlations at subject level. For the 28 units of assessment with reasonable coverage on Scopus (most arts and humanities subjects currently do not fall into this bracket), they created a “bibliometrics grade point average” based on the proportion of each department’s total articles whose citation count fell within the top 5, 10, 25 and 50 per cent globally, and compared this with GPAs calculated in the standard way from scores in the outputs section of the REF (which accounted for 65 per cent of the total REF scores). As the graph at the end of this article shows, correlations varied from 0.76 in biological sciences to −0.04 in anthropology and development studies, where 1 is a perfect correlation and −1 is a perfect inverse correlation.

Institutional evaluations (18 June 2015)

The analysts also looked at how REF GPA correlated with the “field-weighted citation impact” of each department’s papers over the assessment period. This is a widely used metric that accounts for differences in citation patterns between papers of different types, scopes and ages. The findings are similar to those generated from the bibliometrics GPA.

In both cases, correlations tend to be higher for the units of assessment (coloured yellow in the graph) whose assessment panels used citation data to inform their judgements – although some subjects using citations, such as physics, have low correlations, while others that did not use citations, such as business and management, have high ones.

Robert Bowman, director of the Centre for Nanostructured Media at Queen’s University Belfast, said that this reassured him that the physics assessment panel “likely read outputs and submissions and weren’t swayed by metrics alone”.

The conclusion of Nick Fowler, managing director of research management at Elsevier, is that metrics could be useful to “inform” REF evaluations in the units of assessment with the highest correlations, where citations “could to some extent predict results”.

But Ralph Kenna, professor of theoretical physics at Coventry University and one of the authors of the h‑index paper, said that this claim was “way too strong” because even overall correlations in excess of 0.7 could still disguise significant outliers, resulting in “misrepresentation of an individual unit’s place in the (already extremely dodgy) rankings and inflation or damage to its reputation”.

He said that in his study, which came up with roughly similar correlations to Elsevier’s, one submission was ranked 27th, compared with its seventh place in the REF. For this reason, Elsevier’s analysis was “another nail in the coffin for the idea of replacing REF by metrics”.

However, Professor Chambers disagreed. For him, the analysis demonstrates that “a simple metric can predict REF outcomes”. But a “smarter and cheaper” REF would need to be based on several metrics to prevent gaming by universities.

“Perhaps the real test of the UK’s research excellence is whether the collective minds of thousands of academics can generate an intelligent automated system of metrics to replace the current failed model,” he said.

paul.jump@tesglobal.com

Correlation between REF 2014 results and bibliometrics

POSTSCRIPT:

Article originally published as: Can metrics really replace reviewers in REF? (18 June 2015)

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Register
Please Login or Register to read this article.

Reader's comments (9)

These analyses are predicated on the assumption that the current REF scores are the "right" ones, and that if metrics fail to predict them sufficiently well, it is the metrics that are wrong. But why is this necessarily the case? Maybe the outliers were given overly harsh/generous scores by the current system. At least in some (sub)fields, it may be that citation rates are a better measure of quality than the score assigned by REF panel members, who won't always be experts in all of the relevant areas.
The REF Physics Subpanel only "used" citations as a secondary source of information as a sanity check and to "inform", in the phrase of Nick Fowler, the peer review assessment. I can assure Robert Bowman that ALL physics outputs were read by at least two assessors and peer reviewed accordingly. My attitude as Chair of the Physics REF panel to citations was predicated on a study I did before REF began of the three main citation providers. On a randomly selected sample of around twenty physics papers submitted to RAE I found discrepancies, often large, between the citation numbers returned by each of the three providers. If the input data from the providers is discrepant, how on earth can we ever use citation-based metrics to predict anything other than the most broad-brush of conclusions? In addition, since citations patently cannot be used in Arts subjects and any REF-like exercise must treat all subject areas on as level a playing field as practicable, this whole discussion seems moot. I agree that REF has a very substantial cost in terms of staff time and effort but while the government insists on an assessment of university research, I can't think of a viable alternative to something very similar to REF. Of course I would support returning to the old system that some of us may remember of quinquennial grants; however I am not holding my breath until this Nirvana is announced.
The ability to teach and the ability to conduct research are two quite different skills and require different measurement of competence. In some cases citations have become more important than the lecturer being able to speak English clearly, and hence actually teach their students a technical subject. The current situation is a step backwards.
In reply to Ben Ambridge. You are right that the discussion seems to assume the peer review is correct in its outcome and citations analysis does centre around trying to reproduce peer review outcomes. However, if we don't trust peer review to be some sort of gold standard" then all the comparison with metrics is a meaningless waste of time anyway and we should just give up the effort and say we want to reward citations whatever they mean. That way I can relax and get on with gaming the system with colleagues in other universities in a mutual "cite each other" pact- we all know it will happen.
Another point about metrics is that the analysis is carried out with whole departments and even then there are significant outliers and no clear consensus. If such a system where to be brought in how well would you be able to judge on an individual level? That is key because universities would certainly use these same metrics to appraise individuals.
Scopus is seriously deficient in assessing impact in the humanities because it does not cover books or book chapters, and even its coverage of humanities and qualitative social science journals is meagre. The lack of correlation between REF assessments and bibliometric outcomes is not surprising. It would be more useful to examine whether there is a correlation between Google Scholar citation records and REF scores.
Thank you all for your comments; it’s great to see such interest on this topic. I would like to offer some tentative replies to a few specific items: @ Brian Foster: Various databases and services indeed have different indexation and data curation policies, so that absolute numbers of citations for any measured entity may vary. However, the relative number of citations tend to be consistent across data sources: an article that is highly cited in one of them also appears highly cited in the others, and vice-versa. Therefore we can expect the discrepancies to have a minimal (if any) effect on any decisions informed by the data, regardless of the data source(s) used. @ David Riley: Metrics can indeed be used for research assessment at various levels. There have been various calls for the use of metrics in research evaluation (e.g. DORA (http://www.ascb.org/dora-old/files/SFDeclarationFINAL.pdf), HEFCE independent review (http://www.hefce.ac.uk/rsrch/metrics/), Snowball metrics project (http://www.snowballmetrics.com/)). Metrics have been used in recent research evaluation exercises, supplementing or informing qualitative evaluation. Provided that simple methodological guidelines are followed, such as the right selection of metrics to measure a certain facet of research, that appropriate benchmarks and triangulation are applied, and that they are interpreted in the right context, metrics are a useful addition to research evaluation that allow scalable, systematic, and impartial analyses. @ Robert Cribb: Scopus is working on the Book Expansion Project (http://blog.scopus.com/posts/scopus-content-update-books-expansion-project) to increase the coverage of books in the database. Scopus now includes more than 86,000 books (see the book title list at http://www.elsevier.com/__data/assets/excel_doc/0016/91123/Scopus__books_29_4_15.xlsx), with 120,000 expected by the end of 2015. Books are indexed on both book level and chapter level. The focus for the Book Expansion Project is on Arts and Humanities and Social Sciences and indeed more than half of the books covered in Scopus are in these fields. In addition, also more than 500 book series with approximately 30K book volumes are covered in Scopus. Over the years, Scopus coverage of serial publications in the field of humanities has grown to almost 3,500 humanities titles (4,200 when including humanities-related titles). Selection of these humanities journals was partly based on comparison with other reputable databases and lists in humanities and with the help of input from the community. As part of the Cited References Expansion Program (http://blog.scopus.com/posts/scopus-to-add-cited-references-for-pre-1996-content) we are also working on backfill and gapfill of journal content including citation information to as far as 1970.
Brian Foster is spot on. When asked by Paul for comment on the Elsevier report I gave some comment and that concluded "As a physicist it is reassuring to see that they couldn't predict outcomes in our discipline, despite a likely perception that this might be the case; good to see the UoA panel likely read outputs and submissions etc and weren’t swayed by metrics alone." Brian's comments will be welcomed by the community.
There may be a lot of reservations about the significance of using metrics as opposed to specialist opinion at subject and university level to evaluate research quality in the UK. However, here in Brazil there is practically nothing done to measure research excellence at the national level using either metrics or specialist opinion. So while the University of Sao Paulo has been consistently rated the top university in Latin America over the last decade, there are no meaningful analyses of the distribution of research quality within and between university departments and universities that lead to a meaningful redistribution of national research funding as now occurs in the UK. As a semi-permanent visiting professor here in Brazil at the University of Sao Paulo, I observe extreme differences in quality, not only in research but also in teaching that continue largely unchanged because of lack of standardised measurements to rectify the situation. Here, far too much latitude is granted under the moto of academic freedom, resulting in the majority of quality research and teaching falling on the shoulders of a minority of the academics, whilst the majority glide on into a pension amounting to 100% of their last salary without having achieved anything significant in their careers. If this is the best in Latin America then God help the rest. Given this major discrepancy in internal quality, a standardised introduction of metrics with all its possible pitfalls would be a major advance in Brazil. Peter Pearson

Sponsored