Metrics: how to handle them responsibly

Amid concerns about the growing use – and abuse – of quantitative measures in universities, a major new review examines the role of metrics in the assessment of research, from the REF to performance management

July 9, 2015
Couple pushing car full of numbers up hill
Source: Rex

If you have recently received an email from human resources announcing that you are expected to publish three papers over the next year in journals with an impact factor of at least 20, there is one crumb of comfort. You will at least be able to enter the misguided missive for a new, annual “bad metrics” prize, modelled on the Literary Review’s Bad Sex in Fiction Award for cringeworthy descriptions of hanky-panky.

The award, which will go to “the most egregious example of an inappropriate use of quantitative indicators in research management”, is the last of a series of recommendations on the use of metrics in research assessment arising from a major new review, whose report is published today. James Wilsdon, chair of the steering group for the review, admits that the idea is a “bit silly”, but stresses that it illustrates the serious point at the heart of the review’s conclusions: the need for more “responsible metrics”.

The report notes that “the metric tide”, after which it is named, is being whipped up by “powerful currents” arising from, inter alia, “growing pressures for audit and evaluation of public spending on higher education and research; demands by policymakers for more strategic intelligence on research quality and impact; [and] competition within and between institutions for prestige, students, staff and resources”.

Metrics – numbers – give at least the impression of objectivity, and they have become increasingly important in the management and assessment of research ever since citation databases such as the Science Citation Index, Scopus and Google Scholar became available online in the early 2000s. Metrics are particularly popular in political circles. The government commissions a report on UK research strength every couple of years from Elsevier, owner of Scopus, and its most recent headline finding – that with just 3.2 per cent of global research spending and 4.1 per cent of the world’s researchers, the UK receives 11.6 per cent of all citations worldwide and produces 15.9 per cent of the most highly cited articles – is frequently trotted out as proof that the country punches above its weight.

Within universities, too, metrics have been widely adopted, not merely for institutional benchmarking but also, increasingly, for managing the performance of academics. A recent study by Times Higher Education suggested that individual metrics-based targets of one form or another have been implemented at about one in six UK universities.

The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management attributes this state of affairs to the increasing pressure on universities to be “more accountable to government and public funders of research”, and also to the financial pressures imposed on institutions by constrained funding and globalisation.

“Within this culture shift, metrics are positioned as tools that can drive organisational financial performance as a key part of an institution’s competitiveness,” the report notes.

It adds that metrics have “helped to make decision-making fairer and more transparent, and allowed institutions to tackle genuine cases of underperformance”. However, “any moves towards greater quantification of performance management are a cause for concern among some academics, who fear it will erode the traditional values of universities”.

A legitimate, recurrent concern, it says, is that research managers can become “over-reliant on indicators that are widely felt to be problematic or not properly understood…or on indicators that may be used insensitively or inappropriately”, and do not “fully recognise the diverse contributions of individual researchers to the overall institutional mission or the wider public good”.

Another concern is that metrics distort scientific priorities, especially among early career researchers. Wilsdon, who is professor of science and democracy at the University of Sussex, fears that young researchers are pushed to “publish certain sorts of things [only] in certain sorts of places” in pursuit of “the right numbers rather than the right questions. That is a tragedy if we want the best, brightest people to pursue the things that really matter.”

The report also flags up concerns that the use of bibliometrics unfairly disadvantages women (evidence suggests that men are reluctant to cite women) and interdisciplinary research (which tends to be cited less often than papers in the mainstream of disciplines).

The aforementioned use of metrics as targets for individual academics to achieve is another major concern of commentators. Critics fear that this increases incentives to cut corners or to cheat outright, undermining the integrity of research literature. There are concerns that such targets can even undermine the mental health of those struggling to meet the goals. The latter argument was voiced particularly vociferously at the end of last year when Stefan Grimm, a professor of toxicology at Imperial College London, committed suicide after being told that he was failing to bring in the level of grant income expected of an Imperial professor.

According to Wilsdon, grant income targets “at least have the advantage of relative simplicity compared to more algorithmically complex bibliometrics or altmetrics”. But “if they are applied insensitively or elevated above everything else that makes up a rounded research and teaching portfolio, they can be harmful. They also need to be applied in a contextual way; if you demand that all researchers increase grant income at a time when public spending on research is flatlining or falling, you are setting goals that many simply won’t be able to meet.”

A particular bugbear of metrics critics is journal impact factor. This is a measure of the average number of citations received by papers in a particular journal over a particular period (usually two years). It is often pointed out that this average masks a very wide variation among the citations received by individual papers, with a few garnering very high numbers and many never garnering any. Nevertheless, journal impact factor is often used by hiring, promotion and grant review committees as a proxy for the quality of a particular paper or author.

For their part, journals argue that this is not their fault, but some observers blame them for talking up their impact factors and doing everything they can to keep them high, such as constraining the number of papers they publish.

Sir Philip Campbell, editor-in-chief of Nature and a member of the review’s steering group, says that scientists “value highly selective journals, so a massive expansion of Nature, decreasing its selectivity, in order to take the heat out of the abuse of metrics would be perverse”.

He insists that Nature selects papers on scientific merit alone and has repeatedly warned in editorials against misusing journal impact factor. But it is “one measure of a journal’s significance, and there is no reason not to publish its value when it is released, and to highlight any successes”, he adds.

Nevertheless, the review’s report calls on publishers to “reduce emphasis on the journal impact factor as a promotional tool” and to provide more article-level metrics and information on individual contributions to papers “to encourage a shift towards assessment based on the academic quality of an article”.

Another metrics-related concern highlighted in the report is universities’ obsession with their place in university league tables, whose claims to measure institutions against global standards are undermined by their “varying degrees of arbitrariness in the weighting of different components”. Wilsdon is particularly scathing of universities that adopt as a strategic aim the goal of reaching a certain position in rankings, which he sees as a dereliction of universities’ duty to come up with their own definitions of what constitutes quality.

For him, this is the primary illustration of the danger that a metric can become an end in itself: “People reach for things they can measure…rather than taking a step back and asking: ‘What kind of institution do we want to be? What kind of research profile and impact do we want to have, and what are the best targets and indicators to help us move in that direction?’

“It is pretty tragic to see some of our greatest universities abandoning their pretence of having a proper strategic plan and objectives and essentially handing that job over to rankers,” he adds. “You even hear of individual vice-chancellors having huge chunks of their pay packet pegged to performance in a league table. That is also an absolute abrogation of responsibility by governing councils, and whoever has signed off on that should be ashamed of themselves.”

Man kissing number twenty-six in bed

Such abuses, and the opposition they whip up, may go some way towards explaining why, despite the trend towards management-by-numbers in the public sector, the research excellence framework – whose shape is largely fashioned by consensus among academics – is still a largely metrics-free zone. Nevertheless, concerns abound about the costs of such a huge peer review-driven exercise; a recent report by Rand Europe on the 2014 REF put the cost of the impact element alone at £55 million – and the total cost, yet to be announced, is likely to be well in excess of £100 million.

An attempt in 2006 by Gordon Brown, who was then chancellor, to turn the research assessment exercise into a metrics-driven exercise was eventually abandoned after a study by the Higher Education Funding Council for England concluded that citation information was “insufficiently robust to be used formulaically or as a primary indicator of quality”, the report notes. But REF panels were permitted to use metrics to inform their judgements, and the debate never quite went away.

Nor is it a debate that straightforwardly sets government officials against academics. It was reinvigorated in 2013 when Dorothy Bishop, professor of developmental neuropsychology at the University of Oxford, argued in a widely read blog that a psychology department’s h-index – a hybrid measure of the volume and citation performance of its papers – during the assessment period for the 2008 RAE was a good predictor of how much quality-related funding it subsequently received.

Meanwhile, the years after 2008 saw the rapid rise to prominence of “altmetrics”, which track all manner of non-standard supposed indicators of quality, such as mentions or ratings on blogs or social media sites; readership on publishers’ websites or digital libraries; citation in non-scholarly sources such as policy documents; and popularity in scientific social networks such as Mendeley, ResearchGate and As The Metric Tide notes, prominent publishers including Elsevier, Springer and Nature Publishing Group have all added altmetrics to articles in their digital collections.

Altmetrics have been talked up, in particular, as a potential way to quantify impact. Although assessment of impact by case studies is deemed to have been a success in the REF, in 2010 one of the first acts of the incoming universities and science minister, David Willetts, was to delay the REF by a year in order to assure himself that the methodology was robust. Given those doubts, and the pressure on public expenditure, it is perhaps not surprising that in one of his last major acts before stepping down as minister last summer, Willetts commissioned a second review, The Metric Tide, to examine the extent to which the time was now ripe for the greater use of metrics in research assessment in general, and the REF in particular.

However, on the use of metrics in the REF, the view set out in The Metric Tide remains in essence negative. Despite its acknowledgement of the various flaws and biases of peer review, the review’s steering group endorses the “common theme” that emerged from the 153 responses to its call for evidence: that peer review remains the “gold standard” of research assessment and that metrics can, at best, supplement it. Hence, although the limited use of citation data in the 2014 REF (typically in marginal cases or where there was disagreement) was “relatively successful” and there is scope for it to be “enhanced”, the report argues that peer review should remain the primary mechanism for assessment in the next exercise, which is likely to be in 2020.

“Bibliometricians generally see citation rates as a proxy measure of academic impact or of impact on the relevant academic communities. But…quality needs to be seen as a multidimensional concept that cannot be captured by any one indicator [and which] may vary by field and mission,” the report notes. This is particularly true in relation to altmetrics, which are “highly specific to the types of impact concerned”, while the definition of impact adopted by the REF is, rightly, very wide. So adoption of specific altmetrics to assess impact would create “a danger that the concept of impact might narrow and become too specifically defined by the ready availability of indicators…potentially constraining the overall diversity of the UK’s research base”. Worse still, most altmetrics are easy to manipulate and, given that more than £1 billion of funding a year is directly dependent on the REF results, almost inevitably would be.

A major impediment to the greater use of standard bibliometrics in the REF is the still scant coverage of the humanities and some social sciences in the existing databases. The report points out, for instance, that about half of citations for monographs occur in publications other than the journals that dominate such databases. One solution to that problem – already adopted in Italy and Australia (see ‘Global currents: how other countries use metrics’ box, below) – would be to assess science subjects on the basis of citations but keep peer review for humanities and social sciences. But, according to Steven Hill, head of research policy at Hefce and a member of the report’s steering group, the review was warned that such a “hybrid” approach could result in institutions and funders regarding the humanities as less worthwhile. That fear is echoed by Eleonora Belfiore, associate professor of cultural policy at the University of Warwick, who was also part of the steering group. She says that a “two-speed” solution would “push the arts and humanities further to the margin of the sector and of policy”, and “further disincentivise database providers from trying to be more inclusive of humanities publications”.

Besides, she adds, such “special pleading” would not be justified by the evidence because, beyond a few “particular challenges” for the humanities, she has been rather surprised to discover that the concerns around the use of metrics are largely common across all fields.

If peer review is agreed to be the gold standard of research assessment, then the extent to which metrics could replace peer review depends on the extent to which judgements generated by metrics mirror those arrived at through peer review. Bishop’s analysis was followed up last year with a paper by four physicists that made predictions about the ranking of departments in four fields in the REF – physics, chemistry, biology and sociology – on the basis of their h-indices. A paper earlier this year compared those predictions with REF results and revealed that although there were correlations between the h‑indices and the final scores, they were not nearly robust enough to replace peer review. At the time, one of the authors, Ralph Kenna, professor of theoretical physics at Coventry University, said that universities would get more accurate predictions of their likely movement in the REF rankings “by tossing dice”.

On the other hand, at the same March conference at which Wilsdon pre-announced the review’s conclusions about the REF, Nick Fowler, managing director of research management at Elsevier, drew attention to a high correlation between the amount of QR funding awarded to institutions in 2015-16 and the number of highly cited articles they produced during the assessment period. This was particularly true among smaller institutions; for larger ones, there was a stronger correlation between REF scores and departments’ “field-weighted citation impact”, which accounts for differences in citation between papers of different types, scopes and ages.

“Metrics are a type of evidence…Used well, they can be very powerful,” Fowler argued. He repeated the view that metrics could play a greater role in the REF in at least some subjects following a more recent analysis of metrics and the REF carried out by Elsevier for THE. The review steering group asked Hefce to look deeper into the issue by examining correlations between a whole raft of metrics and the peer review scores for individual outputs. Although it found some reasonable correlations (see ‘Analyse this: numbers struggle to predict REF scores’ box, below), it concluded, like Kenna, that these are not strong enough to justify any move to replace peer review with metrics.

For his part, Stephen Curry, professor of structural biology at Imperial and a member of the report’s steering group, dismisses the idea that peer review is a “gold standard”, and he is partly persuaded that, at a “high enough level of granularity”, such as entire departments, “the numbers do start to mean a bit more because you are accumulating some indicator of average behaviour”. Nevertheless, he still believes that research activity is “a complex business” that he cannot envisage ever being “reducible to [any] basket of metrics”. Besides which, he adds, carrying out the REF by peer review has its own value in that it demonstrates higher education’s willingness to put itself through “this rather gruelling assessment period”.

“The UK can wear that as a badge of honour [because] it demonstrates it is immensely self-critical,” he says.

Hill agrees that it is “hard to see how you would get sufficient accountability about the way in which public funding is being spent – rather than [merely] allocated – through an entirely metrics-driven approach”. Furthermore, “the kinds of metrics you might use to allocate funding at macro level might not provide the nuanced information institutions want about their research performance”. He is also sceptical of peer review’s status as a gold standard, but adds that this does not mean that metrics are superior.

“The idea we developed during the review is that different ways of measuring quality will inevitably come up with different answers…So the best you can do is get as rounded a view as you can by looking at a range of different [measurements]. This is at the heart of the responsible metrics argument: everything has to be seen in context, and the more information you have – as long as it is reliable – the better.”

Woman measuring red number one

According to the report, responsible metric use involves being transparent about the use of a range of robust metrics that are inclusive of all fields, while bearing in mind the potential wider effects of their use and “updating them in response”. Curry admits that this notion of responsibility is not a new one: it has already been pushed in recent declarations against the misuse of metrics, such as 2013’s San Francisco Declaration on Research Assessment and 2015’s Leiden Manifesto for Research Metrics.

The former, popularly known as Dora, “isn’t saying you can’t publish in journals with a high impact factor: it is saying that when you do assessment and hiring, you must look for a better way than just relying on the name of the journal – it is a fairly modest ask”. Despite this, he notes, very few UK universities have yet signed it. “That, to me, tells you an awful lot about the present culture,” Curry says. Universities see the use of journal impact factors as “the game we play, the one everybody understands so [the thinking goes] we would be fools not to because it would hurt our bottom line.”

Curry says that the steering group debated whether to call on universities to sign Dora but settled instead for suggesting that they might want to as part of a process of developing “a clear statement of principles on their approach to research management and assessment, including the role of quantitative indicators”, on which basis they should “carefully select quantitative indicators that are appropriate to their institutional context”.

Wilsdon admits that it might be “a bit idealistic” to hope that reflection will, in itself, end bad practice (in which academics themselves are sometimes complicit – by, for example, putting their h‑index in their email signatures). But he points out that the report also contains a raft of practical, technical measures to improve the accuracy of metrics. These revolve around an insistence that all institutions, outputs and individuals submitted to the next REF should have unique numerical identifiers to avoid ambiguities and to increase interoperability between the various systems used to capture research data, such as Researchfish, Gateway to Research and universities’ internal databases.

“If you talk about administrative burden and only focus on the REF, you are missing a lot,” Wilsdon says. “The real issue is how do you stop [academics] having to enter the same information 16 times? That is much more important, in a sense, than what we do at a policy level in the REF.”

More interoperability could also help to avoid a danger of which Wilsdon is keenly conscious: that by recommending even a modest rise in the use of metrics in the next REF without a corresponding drop in the peer review load, the report risks increasing still further the bureaucratic burden and cost of the exercise – particularly if, as in 2014, institutions still feel the need to manually check their citation data.

Hefce is not obliged to accept the review’s recommendations, but according to Hill, there is little in it that the funding council disagrees with. Hefce, he continues, will “feed thinking from it” into its own consultation on the shape of the next REF, which will open in the autumn, with decisions in late spring 2016.

Curry admits to being a bit disappointed that no “grand new vision” crystallised out of the steering group’s deliberations and is conscious that the report might come across as mere “common sense”.

“But people don’t always see what the common-sense thing to do is, and I hope that message comes across loud and clear that responsibility [is required]. It is very much a human, complex activity we are looking at, and we have to do our best by it: use tools available to us as best we can but don’t let them run amok.”

Wilsdon also admits that the report risks being criticised for recommending an excessively “incremental” approach. “But most people will recognise we have avoided proposing radical change for the sake of it, because we don’t think the evidence supports it.”

He hopes that, at the very least, the report will help the academy to move “beyond the situation where managers are uncritically defending the utility of metrics…and critics are saying that academia is so ineffably precious it can’t possibly be captured in any metrics — ‘leave us alone to get on with what we are doing and please give us more money’.”

For Wilsdon, metrics are no more intrinsically good or bad than controversial technologies such as genetically modified crops.

“You can have a debate about the positive ways they could be used [set against] how they could cause problems and the inappropriate ways they could be used,” he says. “We need that degree of sophistication in relation to research management. If we are entering a world drenched in big data, the research system isn’t going to sit outside that. The question is how we use the power of real-time data collection and analysis of our own activity to shape the kind of research culture and system we want, rather than allowing crude, inappropriate uses to steer us off in directions we don’t want to go.”

Global currents: how other countries use metrics

Metrics play varying roles in the research assessment exercises carried out by other countries, the review learned.

The quality element of the six-yearly reviews conducted for New Zealand’s Performance-Based Research Fund — assessing individuals rather than departments — is based on peer review.

By contrast, the Danish Bibliometric Research Indicator (BFI), on the basis of which a quarter of public university funding is allocated, is based on points assigned according to the types and publication destinations of research outputs. Every year expert panels draw up a list of publishers, journals and book and conference series that they consider to be reputable, and split them into two quality levels.

Meanwhile, hybrid approaches are adopted in both Italy and Australia, with the humanities and social sciences assessed on the basis of peer review, and the natural sciences and engineering judged by metrics.

Italy’s Evaluation of the Quality of Research (VQR) involves journal impact factor and citation counting.

Australia’s Excellence in Research for Australia (on which no funding rides) involves a basket of measures, some of which – patents, plant breeders’ rights, registered designs and research commercialisation income – are intended to capture wider impact. But the UK metrics steering group heard there were concerns about the “implied narrow definition of societal impact and the potential that focusing on a small number of metrics might significantly skew behaviour”, as well as about the hybrid approach’s “potential to lead to perceived hierarchies which may cause significant tension between disciplinary groups”.

Analyse this: numbers struggle to predict REF scores

The metrics steering group asked the Higher Education Funding Council for England to dig into the fine detail of correlations between bibliometrics and peer review scores in the research excellence framework by examining the relationship between a suite of 15 metrics and the scores given to individual outputs across all units of assessment in 2014.

The analysis finds that metrics had a relatively poor ability to predict which papers were given a 4* rating, although correlations are higher in certain fields, such as clinical medicine, biological sciences, chemistry and economics.

The metric with the highest overall correlation was SCImago Journal Rank – a citations-based measure of the importance of journals. However, the correlation was only 0.34, where 1 is a perfect correlation. The most useful metrics for predicting REF scores across a wide range of fields were SCImago Journal Rank and Google Scholar citations.

Where metrics ratings were the same, papers by early career researchers under main panel A (biological sciences) were more likely to be deemed 4*, while those authored by early career researchers under main panel C (social sciences) were less likely to be deemed 4* than those written by more senior academics, especially in economics and social policy.

Female authors were less likely than men to have 4* papers, especially in main panels B (physical science) and C. However, female authorship did not correlate significantly with metrics scores.


Article originally published as: The weight of numbers (9 July 2015)

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please Login or Register to read this article.

Related articles

Reader's comments (2)

Many dozens of paragraphs here -- and not a single reference anywhere to any of the humanities. This by Paul Jump illustrates wonk speak perfectly. Just go on and on, emitting many abstractions, citing many others who speak but wonk speak only. But never -- ever -- mention for perspective, human perspective, any novel, poem, song, film, or other work of art. In the world of metrics, all is numbers, all are abstractions -- no one needs any human context. It's gone, dead, and the living dead, the metric-depersonalized living dead totally rule all corporate academe.
Much of the debate was summed up in the 1950's by Goodhart. When we set out to measure something intangible, we resort to a proxy. Goodhart reminded us that when the proxy becomes the target, it ceases to be a good proxy. When we try to measure something as complex and nuanced as impact with a single index which is essentially retrospective and then use that measure to direct future research, it is bound to end in tears.