Measures without meaning

Peer review lets us reward excellence when we see it; spurious 'absolute' standards do not, says Ron Johnston

May 8, 2008

The Government has changed its mind about the proposed replacement for the research assessment exercise - the research excellence framework. Initially science, technology, engineering and medicine (STEM) disciplines were to be evaluated using metrics (largely citation impact data) whereas non-STEM subjects (humanities and social sciences) were to have a "light-touch form of peer review (sic) informed by quantitative indicators".

The Department for Innovations, Universities and Skills now proposes a "new overarching framework ... For all subjects, assessment will include metrics-based indicators, including bibliometric indicators wherever appropriate, as well as input from expert panels" (Times Higher Education, 24 April).

In the original proposal, STEM subjects were grouped into just six categories (as broad as biological and physical sciences) with citation rates per paper in each normalised by "field", type and year of publication. "Field" was defined as ISI's (Thomson's Web of Science) subject categories.

Are bibliometric indicators appropriate for non-STEM subjects, which, like most sciences, operate as myriad relatively small communities? Few scholars transcend these with work widely read across their discipline, and such communities are much smaller than ISI's subject fields. It would be impossible to calibrate indicators to accommodate their size and variability, and individuals working in "minority" areas - the great majority - would be disadvantaged unless the indicators were interpreted by the expert panels, which would vitiate their use in the first place. ISI data cannot be readily downloaded and "normalised" to produce reliable measures of quality for non-STEM.

Many other problems of using quantitative indicators of research excellence (both citation impact and research income) have been identified, all pointing to their undesirability (Environment and Planning A, vol 40, March). Evaluation can be done only by peer review; experts use knowledge and experience to make subjective judgments as in previous RAEs. That practice should continue - if the exercise has to.

For RAE 2008, the overseeing body has tried to "objectivise" the subjective judgment-making. Its three key criteria are originality, significance and rigour, with each output placed on a five-grade scale: those getting the highest rating will be "world-leading", followed by "internationally excellent", "recognised internationally", "recognised nationally" and "falls below nationally recognised". These are "absolute standards". Previous RAE panellists have judged quality against similar phrases but were not required to place each output into one of five categories.

To ensure "transparency", each RAE 2008 sub-panel was asked for its own gloss on the definitions. Their alternative wordings, analysed in Higher Education Quarterly, vol 62 (1/2), provide no greater clarity, illustrating that creating absolute, measurable standards of excellence is an impossible task.

The sub-panels have simply rewritten the definitions in alternative wordings that are no more objective. For example, several refer to "outstanding", "highly significant", "significant" and "a" contribution as meriting grades of 4*, 3*, 2* and 1* respectively, whereas another uses "highly significant", "significant", "recognised contribution" and "limited contribution". One defines 4* outputs as "setting the research agenda ... likely to have a lasting impact", and another as "comparable to the best work in the field". Others use phrases such as "is, or ought to be, an essential point of reference ... and makes a contribution of which every serious researcher in the field ought to be aware", whereas outputs in lower grades are "likely to inform subsequent work" or "merits some attention". And one sub-panel opines that 2* work offers "an incremental advance within existing paradigms, traditions of inquiry, or domains of policy and practice". None of these clarifies to which grade outputs should be allocated.

The US Supreme Court Justice Potter Stewart famously said he could not define hardcore pornography. "But I know it when I see it." That claim could be equally well applied to past RAEs, and it should continue to do so. If we must have an REF as a partial basis for university funding, at least don't try to dignify it with quantitative indicators and spurious definitions of grades of excellence. Appoint (or elect) the experts and let them exercise their judgment without any meaningless constraints.