REF panels must blend citation and publication data

A weighted approach is the best bet for assessment, say Andrew Oswald and Daniel Sgroi

September 19, 2013

Source: Patrick Welham

Citations, whatever their faults, are observable in a way that is not true of the nodded approval of quietly self-selecting scientific communities

A little-remarked feature of the forthcoming research excellence framework is that subject review panels are forbidden from using information on journal impact factors or other journal rankings.

When one of us argued in these pages a few years ago that such a rule was illogical and foolish, it sparked debate. Yet everyone knows that different journals have different quality standards for publication, so what is the point of pretending otherwise?

Although peer review by journals also necessarily involves a level of subjectivity, specialist journals employ specialist referees, and REF panel members can learn a lot from their considered decisions. Editors and referees are simply far more knowledgeable about the average published paper on Martian termites than are members of the REF panel on exobiology.

In addition, it is being untruthful to ourselves and the world to suggest that REF panel members have enough time to assess properly the submitted papers in areas miles from their own specialisms. If asked whether this is true, the panellists are forced to dissemble.

So how might the panels sensibly use information on journal impact factors? In a new paper in The Economic Journal, “How Should Peer-Review Panels Behave?”, we argue that evaluation panels should blend citations and publications data, and then throw in a dash of oversight. The way to do this is to turn to Thomas Bayes, the clergyman who 300 years ago developed what are today known as Bayesian methods.

The idea is straightforward. Panellists would take a weighted average of the journal’s impact factor and the accrued citations to the article, with the weighting adjusted over time according to Bayesian principles. More and more weight would be put on citations in proportion to the length of time since the paper was published. In the early years, almost all the weight would be put on the journal impact factor, since citations would not have had time to accrue. The panellists could shade up or down the resulting quality assessment based on their own judgement.

We are familiar with the weaknesses of journal impact factors, such as the fact that they do not necessarily reflect the citation rate and/or quality of every paper they contain. But it is not rational to conclude that no weight should therefore be put on them, particularly right at the start of an article’s life. In the first few months post publication, the impact factor is the best guess we have about the likely importance of an article. Several decades later, a paper’s citation count plays that role. In between, weightings on each have to be chosen – and we believe that our paper merely formalises a way of doing so that is already carried out by many experienced researchers.

To give a practical example, consider an article published in the not-so-fancy journal NITS (Notional Insights on Termite Science). Imagine that a REF panel discovers that, after a small number of years, a specific article published in NITS happens to have a significantly better citation record than one in fancy journal HDILPOP (Huge Discoveries by Ivy League Professors Only Please). How should the panel react?

In the language of our paper, the citations record of the particular NITS article constitutes a series of good Bayesian signals, whereas the citations record of the particular HDILPOP article does not. A reasonable question for the panel is: how long should we persist in downgrading the NITS article on the basis of the journal’s impact factor if the high relative citations to it continue? In one illustrative calculation, we find that Bayes’ rule would suggest that roughly four years of conflicting citation data are needed before the original opinion should be reversed.

Journal articles are the main raw material of modern science – and arguably have the advantage, whatever their faults, of having been through a form of refereeing. Citations to them are the main marker of those articles’ influence – and arguably have the advantage, whatever their faults, of being observable in a way that is not true of the nodded approval of quietly self-selecting scientific communities. The REF panels should use that information.

This certainly does not mean that mature overview by experienced human beings ought to have no role. Purely mechanical procedures should never be used in REF-like evaluations of universities, scholars, departments or disciplines. Nevertheless, a weighted average of impact factor and article citations is the natural starting point for a sensible REF panel. And that goes for the social sciences and humanities, as well as for the field of Martian termite studies.

Register to continue

Why register?

  • Registration is free and only takes a moment
  • Once registered, you can read 3 articles a month
  • Sign up for our newsletter
Please Login or Register to read this article.

Reader's comments (7)

What about truly innovative work that does not get cited for a variety of reasons, for example if very few have contemporary expertise to use or comment on the findings or if very few are working on the subject area of the publication. Controversial findings may sometimes be marginal but receive attention in an effort to settle a detail (to provide an opposite example where high citation may not reflect quality). Science should not be reduced to fashion competition. I see no evidence in the proposal that metrics should primarily drive research assessments - as opposed to expert peer review. I also feel that setting researchers under conditions of constant assessment makes their lives rather unpleasant. How this affects the future of the profession remains to be seen.
The problem with this approach is that it assumes a direct (and apparently linear?) relationship between the quality of a paper and the number of citations it accrues. For some counter-arguments, see
A small number of UK scientists, including myself, actually work on termites, though terrestrial ones and not (yet) the Martian kind. Let me use my own case to inform this debate. In a career of 40 years, my two most cited articles concerning termites (there were also co-authors) appeared in a) Nature (current impact factor 38.6) and b) European Journal of Soil Biology (current impact factor 2.01). As it happens the second of these articles has the highest number of citations in the history of the journal, although it has taken 25 years to achieve this status, but before publication it was rejected by another journal with a higher profile (current impact factor 6.91) on the grounds of being insubstantial. Were I facing performance management in my old department at a Russell Group university (as many former colleagues are now), only the Nature paper would count to my credit, and then only on condition it had been published in the last five years. In restructuring the department the management of this institution takes the intensely short-term view that impact factor outweighs accumulated citation, and its hiring, firing and promotion policies are calibrated accordingly. My own view is that each of my papers has had roughly the same influence, though the second is more commonly cited within the field. But I am lucky to have kept my job long enough to reach this conclusion. Incidentally, I mildly resent the way in which termites are treated in the media with amusement, bordering on ridicule. The number of termites on Earth exceeds the number of humans by approximately eight orders of magnitude. Should a Martian biologist land on our planet (I assume this hasn't already happened) he (or she or it) would surely first ask to the taken to meet the Chief Termite. And with good reason.
This article by two economists nicely illustrates what is wrong with economics today: a willingness to sacrifice real-world relevance to simplistic models. Economics is a field in crisis because of this tendency. Its leading journals are full of articles that put forward simplistic (yet mathematically complicated) models of some analytical elegance but very limited applicability - whose practical use requires substantial intellectual corner-cutting by eliminating anything that does not fit their assumptions. Government policies based on them are often ineffective or downright wrong, even dangerous - as the failure of economic policy based on them - that was a major contributory cause of the financial crash of 2008 and continuing recession - shows. The real question that will face the Economics REF subpanel is whether they ignore the failure of this kind of currently fashionable economic thinking (for example ideas sometimes called Zombie Economics - efficient markets hypothesis, rational expectations theory, general equilibrium theory and so on) - that have been shown to have failed yet are still filling the pages of the leading journals, by applying a simple common-sense test of impact. The other trouble with the blindly unworldly Sgroi-Oswald methodology is that work criticising such failed thinking counts as a positive measure of its quality. Their own paper is published in the Economic Journal, a high ranking journal, so it starts with a very good prior score. Any comments pointing out its fairly obvious limitations are counted as citations which enhance its metric-based score!
Re “everyone knows that different journals have different quality standards for publication, so what is the point of pretending otherwise?” Where’s the evidence that high impact journals have higher quality reviewing? It’s not my experience, and I think there are two reasons for this. First, the high impact journals go for newsworthiness, which typically trumps all other considerations. Often the decision of whether to send a paper out to review is made by an editor who has no expertise in the area, but just judges whether the research is ‘exciting’. See e.g. It has been pointed out that while the impact factor of the journal is a poor predictor of the citations for individual papers in the journal, it is a better predictor of retractions. See Brembs, B., Button, K., & Munafò, M. (2013). Deep Impact: Unintended consequences of journal rank. Frontiers in Human Neuroscience, 7. doi: 10.3389/fnhum.2013.00291 Second, high impact journals are typically less specialised than other journals - e.g. Proceedings of the National Academy of Science covers all of science. The editor is less likely to know which reviewers are experts in an area. Not surprisingly, then, methodologically weak papers in high impact journals occur all the time. It’s got to the point where if I see a psychology or neuroscience paper published in PNAS I anticipate that it will have major methodological flaws. In my experience, if you want to get really thorough reviewing, you go to a journal that publishes lots of papers in the area of your study, with an editor who is well-regarded in your field. Typically, this will not be a high-impact journal.
The authors acknowledge that "journal articles are the main raw material of modern science" but then go on to outline plans based on a metric that fails time and again at the article level. Statistical illiteracy of the highest order, I'm afraid:
VALIDATING AND WEIGHTING METRICS AGAINST REF PEER RANKINGS Metric predictors of research performance need to be validated by showing that they have a high correlation with the external criterion they are trying to predict. The UK Research Excellence Framework (REF) -- together with the growing movement toward making the full-texts of research articles freely available on the web -- offer a unique opportunity to test and validate a wealth of old and new scientometric predictors, through multiple regression analysis: Publications, journal impact factors, citations, co-citations, citation chronometrics (age, growth, latency to peak, decay rate), hub/authority scores, h-index, prior funding, student counts, co-authorship scores, endogamy/exogamy, textual proximity, download/co-downloads and their chronometrics, etc. can all be tested and validated jointly, discipline by discipline, against their REF panel rankings in the forthcoming parallel panel-based and metric REF in 2014. The weights of each predictor can be calibrated to maximize the joint correlation with the rankings. Open Access Scientometrics will provide powerful new means of navigating, evaluating, predicting and analyzing the growing Open Access database, as well as powerful incentives for making it grow faster.