Why do scientists struggle to reproduce results?

Researchers face pressure to hype and report selectively, says Dorothy Bishop

June 4, 2015
Dorothy Bishop illustration

It is 350 years since the Royal Society published the first issue of Philosophical Transactions, marking the start of scholarly scientific communication. To celebrate this event, and to anticipate future developments, the Royal Society recently held four days of discussion, including one on reproducibility and fraud in science.

At the meeting there was general agreement that lack of reproducibility was a serious problem in many areas. Quite simply, science depends on the assumption that if you run an experiment, and document clearly the conditions under which you ran it, then another person should be able to get a comparable result. Yet difficulties have been surfacing in many fields, even attracting the attention of The Economist; a 2013 article, “Unreliable research: trouble at the lab”, discussed a range of reports of low reproducibility, including one in which a drug company managed to replicate the results of just six of 53 landmark cancer studies.

So what has gone wrong? A distinction is usually drawn between fraud (which is generally thought to be fairly rare) and questionable research practices, such as inappropriate use of statistics, selective reporting of data and hyping of findings. However, the boundary between these is not always easy to draw.

Statistics are crucial in many fields where we try to detect an effect in a context where there is uncontrolled variation. We know, for instance, that there are differences in height between males and females, but if I just measured a sample of five men and five women, I might not find much difference: there are many factors other than gender that affect height and we need a larger sample to detect the male-female difference among the background noise. In cases like this, we rely heavily on statistics to determine when an observed difference is meaningful and when it is not.

It is all too easy to feed data into a statistical package that spews out numbers, but it is crucial to understand what they mean. Anyone who is using statistics should first be shown how to generate random data. Suppose you do a study in which you compare two groups of 20 people on 10 measures. Most researchers get excited if they find a group difference that is “significant” – in other words, that has an associated probability (p-value) of less than .05. However, if you have 10 comparisons, then you will find at least one “significant” difference in your random dataset on about 40 per cent of occasions. There’s a huge difference between using statistics to test a specific prediction and looking for any “significant” finding in a large dataset.

There are methods for adjusting the statistical test to take into account the numerous comparisons that are being considered, but these are often ignored, leading to unreliable, irreproducible findings. It is common practice for researchers simply to report the “significant” effects in a dataset without mentioning other variables that were tested and found to be uninteresting.

Is such “p-hacking” fraudulent? In some cases, the people who do this are simply ignorant, and may indeed have been trained in such practices by their supervisor. However, I suspect that many who engage in it are well aware that it is a “questionable” research practice but don’t regard it as unethical because it has become so normative.

A common response to this problem is to argue for more and better training in statistics. This is undoubtedly needed, but it ignores the deeper problem, which is the incentive structure created by journals, funders and higher education institutions. All of these emphasise the value of novel, groundbreaking research. If you have just spent three years running a study and you find a series of null results, you know that the likelihood of it’s being publishable is low. It becomes very tempting, therefore, to drop most of the variables in the study and to focus on just those with a “significant” p-value.

To many researchers, this doesn’t feel like fraud because they aren’t making anything up: everything they report is true. But it is fraud because the interpretation of the statistics is completely different if the full picture is presented.

The incentives provided by funders pile further pressure on researchers to distort their results. The term “incremental” is the death knell for a research proposal. In the US, the National Institutes of Health moved explicitly away from a focus on methodology to a focus on innovation. But is this wise? Yes, we need creative individuals who will advance their topic in new directions, but there are some areas of research that require years of sustained work to come to fruition – with many failures and wrong turnings along the way.

In her talk at the Royal Society meeting, Ottoline Leyser, director of the University of Cambridge’s Sainsbury Laboratory, noted that today’s early career researchers are made to think that the only way they can get a secure job is to come up with a totally original idea that not only turns out to be valid and leads to a publication in Nature or Science, but which then also leads to a spin-off company that turns it into a real-world application that will dramatically alter people’s lives. If you love science but are told that this is the route to success, then misery and conflict ensues: either you cut corners to achieve exciting results, or you leave the field. Doing careful studies with uncertain outcomes over a period of years does not seem to be an option.

For things to change, we need to alter the incentive structure, as well as focusing on better training in methods. Researchers have to feel that they will be rewarded for doing reproducible research. Yes, we want new and exciting findings, but if they are not reproducible, then they are worse than useless. Trying to build on a hyped, unreliable result is like trying to build a house on a foundation of quicksand: expensive, demoralising and wasteful.

Dorothy Bishop is professor of developmental neuropsychology at the University of Oxford.

You've reached your article limit

Register to continue

Registration is free and only takes a moment. Once registered you can read a total of 6 articles each month, plus:

  • Sign up for the editor's highlights
  • Receive World University Rankings news first
  • Get job alerts, shortlist jobs and save job searches
  • Participate in reader discussions and post comments
Register

POSTSCRIPT:

Article originally published as: Houses built on sand (4 June 2015)

Have your say

Log in or register to post comments

Featured Jobs

Vice Principal DURHAM UNIVERSITY
Reader/Professor of Race and Education LEEDS BECKETT UNIVERSITY
Professor of Teacher Education LEEDS BECKETT UNIVERSITY

Most Commented

question marks PhD study

Selecting the right doctorate is crucial for success. Robert MacIntosh and Kevin O'Gorman share top 10 tips on how to pick a PhD

Pencil lying on open diary

Requesting a log of daily activity means that trust between the institution and the scholar has broken down, says Toby Miller

India, UK, flag

Sir Keith Burnett reflects on what he learned about international students while in India with the UK prime minister

Application for graduate job
Universities producing the most employable graduates have been ranked by companies around the world in the Global University Employability Ranking 2016
Construction workers erecting barriers

Directly linking non-EU recruitment to award levels in teaching assessment has also been under consideration, sources suggest