Academic conclusions differ wildly even on same data, study finds

New study raises questions over validity of single studies and suggests detailed tracking of researchers’ decisions during analysis

Published on

July 26, 2021

Last updated

August 2, 2021

David Matthews

Twitter: @DavidMJourno

woman looking in mirror holding a mask as a metaphor for Academic conclusions differ wildly even on same data, study finds

Source: Getty

Academics come to vastly different research conclusions even when given the same questions and dataset, raising the need for scholars to meticulously document the decisions and judgements they make during their work, a new study has found.

Twenty-nine teams of analysts tested two hypotheses on a common dataset of online academic discussions.

The first hypothesis was that “a woman’s tendency to participate actively in a conversation correlates positively with the number of females in the discussion”.

The second postulated that “higher-status participants are more verbose than are lower-status participants”.

By tracking the decisions made by researchers using a new tool called DataExplained, the study discovered just how open to interpretation these questions were.

Some analysts defined “high status” as an academic’s job rank, whereas others used citations, for example. “Verbose” could mean the number of words in an academic’s comment or the number of comments they made over the course of a year. Different teams also used different statistical techniques and sample sizes.

“Where you make judgements, there is noise, and more than we think,” said co-author Martin Schweinsberg, assistant professor of organisational behaviour at the business school ESMT Berlin.

The result was that “researchers reported radically different analyses and dispersed empirical outcomes”, according to the paper, which was published in Organizational Behavior and Human Decision Processes.

For the second hypothesis, testing a link between status and verbosity, 29 per cent of analysts found evidence in support, but 21 per cent concluded the exact reverse.

As for the idea that women speak more when other women are present, there was more consensus, with nearly two-thirds finding support for this hypothesis. Still, more than a fifth found an effect in the opposite direction.

The findings “very vividly show” just how many ways there are of tackling a seemingly simple question, said Dr Schweinsberg.

The work is the latest in a series of crowdsourced experiments in which multiple research teams independently tackle the same question with the same data. One 2018 experiment explored racial bias, looking at whether football referees gave more red cards to dark-skinned players.

A majority found evidence of racial bias, but the spectrum of findings was huge, with the “disturbing implication [being] that if only one team had obtained the dataset and presented their preferred analysis, the scientific conclusion drawn could have been anything from major racial disparities in red cards to equal outcomes”, according to Dr Schweinsberg’s paper.

This latest experiment is different in that it closely tracked participants’ decisions through DataExplained. “We provide a step-by-step chain to see how this happened,” Dr Schweinsberg said. “Every few lines of code, we ask them a set of standard questions about paths taken.”

The platform, made public earlier this year, “records all executed source code and prompts analysts to comment on their code and analytical thinking steps”, the study explains.

Whether this kind of systematic monitoring makes sense depends on the question asked, said Dr Schweinsberg. “If someone’s dead or alive, there’s not much ambiguity,” he said, although his paper points to contradictory findings even in the medical field.

“If the question is big enough and has implications that are important enough, it might be sensible to [do] something like this,” he suggested.

david.matthews@timeshighereducation.com