It is 350 years since the Royal Society published the first issue of Philosophical Transactions, marking the start of scholarly scientific communication. To celebrate this event, and to anticipate future developments, the Royal Society recently held four days of discussion, including one on reproducibility and fraud in science.
At the meeting there was general agreement that lack of reproducibility was a serious problem in many areas. Quite simply, science depends on the assumption that if you run an experiment, and document clearly the conditions under which you ran it, then another person should be able to get a comparable result. Yet difficulties have been surfacing in many fields, even attracting the attention of The Economist; a 2013 article, “Unreliable research: trouble at the lab”, discussed a range of reports of low reproducibility, including one in which a drug company managed to replicate the results of just six of 53 landmark cancer studies.
So what has gone wrong? A distinction is usually drawn between fraud (which is generally thought to be fairly rare) and questionable research practices, such as inappropriate use of statistics, selective reporting of data and hyping of findings. However, the boundary between these is not always easy to draw.
Statistics are crucial in many fields where we try to detect an effect in a context where there is uncontrolled variation. We know, for instance, that there are differences in height between males and females, but if I just measured a sample of five men and five women, I might not find much difference: there are many factors other than gender that affect height and we need a larger sample to detect the male-female difference among the background noise. In cases like this, we rely heavily on statistics to determine when an observed difference is meaningful and when it is not.
It is all too easy to feed data into a statistical package that spews out numbers, but it is crucial to understand what they mean. Anyone who is using statistics should first be shown how to generate random data. Suppose you do a study in which you compare two groups of 20 people on 10 measures. Most researchers get excited if they find a group difference that is “significant” – in other words, that has an associated probability (p-value) of less than .05. However, if you have 10 comparisons, then you will find at least one “significant” difference in your random dataset on about 40 per cent of occasions. There’s a huge difference between using statistics to test a specific prediction and looking for any “significant” finding in a large dataset.
There are methods for adjusting the statistical test to take into account the numerous comparisons that are being considered, but these are often ignored, leading to unreliable, irreproducible findings. It is common practice for researchers simply to report the “significant” effects in a dataset without mentioning other variables that were tested and found to be uninteresting.
Is such “p-hacking” fraudulent? In some cases, the people who do this are simply ignorant, and may indeed have been trained in such practices by their supervisor. However, I suspect that many who engage in it are well aware that it is a “questionable” research practice but don’t regard it as unethical because it has become so normative.
A common response to this problem is to argue for more and better training in statistics. This is undoubtedly needed, but it ignores the deeper problem, which is the incentive structure created by journals, funders and higher education institutions. All of these emphasise the value of novel, groundbreaking research. If you have just spent three years running a study and you find a series of null results, you know that the likelihood of it’s being publishable is low. It becomes very tempting, therefore, to drop most of the variables in the study and to focus on just those with a “significant” p-value.
To many researchers, this doesn’t feel like fraud because they aren’t making anything up: everything they report is true. But it is fraud because the interpretation of the statistics is completely different if the full picture is presented.
The incentives provided by funders pile further pressure on researchers to distort their results. The term “incremental” is the death knell for a research proposal. In the US, the National Institutes of Health moved explicitly away from a focus on methodology to a focus on innovation. But is this wise? Yes, we need creative individuals who will advance their topic in new directions, but there are some areas of research that require years of sustained work to come to fruition – with many failures and wrong turnings along the way.
In her talk at the Royal Society meeting, Ottoline Leyser, director of the University of Cambridge’s Sainsbury Laboratory, noted that today’s early career researchers are made to think that the only way they can get a secure job is to come up with a totally original idea that not only turns out to be valid and leads to a publication in Nature or Science, but which then also leads to a spin-off company that turns it into a real-world application that will dramatically alter people’s lives. If you love science but are told that this is the route to success, then misery and conflict ensues: either you cut corners to achieve exciting results, or you leave the field. Doing careful studies with uncertain outcomes over a period of years does not seem to be an option.
For things to change, we need to alter the incentive structure, as well as focusing on better training in methods. Researchers have to feel that they will be rewarded for doing reproducible research. Yes, we want new and exciting findings, but if they are not reproducible, then they are worse than useless. Trying to build on a hyped, unreliable result is like trying to build a house on a foundation of quicksand: expensive, demoralising and wasteful.
Dorothy Bishop is professor of developmental neuropsychology at the University of Oxford.