The risks of not sharing data are greater than the costs

Large investments are needed to make research data open and accessible but tackling global problems depends on it, says Paul Ayris

Published on

February 8, 2020

Last updated

February 8, 2020

Paul Ayris

Twitter: @ucylpay

Source: iStock

Research data represent a new currency. Traditionally, publishers did not publish or curate research data, they were interested only in publications. Now, research data and publications are at least equally important.

Data aren’t just good for science. They are also good for driving the kind of innovation needed to solve some of the biggest issues facing the world today. The outbreak of the coronavirus is a perfect example of how open data can help tackle a major global issue.

The spread of the virus sparked discussion in Paris last month where nine global university networks signed the Sorbonne Declaration vowing to develop appropriate reward systems for scholars who made their data open and accessible. It was a major milestone in advancing the cause of research data management in an open science/scholarship world.

But, without investment and culture change for managing and sharing data, the risk could be bigger than the return.

A recent European Commission Report found that sharing and better managing research data would save €10.2 billion per year in Europe, with an additional potential of €16 billion of added value by the innovation generated.

Research data that are FAIR (findable, accessible, interoperable and reusable) and open for sharing and reuse mean that researchers have easy access to experimentation by others, thus avoiding the need for costly duplication.

They also enable research groups to look at research methodology, to test the results produced by others, and to detect mistaken, even fraudulent, use of data. This is an important part of research integrity which, in the UK, is being promoted by the UK Reproducibility Network.

Not all research data can be open, because much – such as patient data – contains personal information that would be inappropriate to share. In this sense, open data are different from open access publications, where current debates centre on making the whole of a publisher’s output open, rather than a part of it.

These are the kinds of challenges for universities in embracing FAIR, open data.

Universities have to create institutional research data repositories in which data can be curated and stored for the long term, as “open as possible, as closed as necessary”. In turn, there need to be data specialists to run and manage the service, and these salary costs need to be met.

So new is the concept of research data management that the first report on the European Open Science Cloud stressed that Europe needed half a million research data stewards within a decade. It also said that well-budgeted data stewardship plans should be made mandatory, with the expectation that, on average, about 5 per cent of research expenditure should be spent on properly managing and stewarding data.

These are not trivial costs.What proportion of these costs should be funded by the university, and how much should be written into grants from research funders? That is an ongoing debate. And promotion and reward schemes need to value data, alongside publications, to encourage researchers to adopt open practices.

For researchers in many disciplines, the concept of sharing data and making them open is completely new. In other areas, such as high-energy physics or astronomy, it is virtually business as usual. The great challenges that face society – such as poverty, global warming, ill health and injustice – are best tackled collaboratively. Research is quicker and, where open and transparent, available for close scrutiny.

How open is the research community now? As Dr Simon Hodson, executive director of CODATA, the committee on data of the International Science Council, showed at the launch of the Sorbonne Declaration, the response to the Ebola outbreak included many organisations and the resulting data were very scattered geographically – 65 per cent of the data collected was not shared.

Most data cannot be accessed directly at the record level (eg, it’s summarised in studies and not shared). Meanwhile, most clinical records from the outbreak are simply PDF scans (and thus not universally accessible).

There is a lack of both metadata to describe the data and also a common data dictionary (a set of definitions that allows the variables in the data to be understood). It is also technically difficult to integrate all the different types of data that have been collected.

There are lessons to be learned here – data capture for the current coronavirus must not make the same mistakes.

The benefits of open data can be seen in the Human Genome project, which is regarded as a model for making research data openly available over the internet. Before the project, scientists shared their research findings in scientific journals. By the end of the project, scientists were willing to release their data to the world before publication.

Former US president Bill Clinton called it “one of the most important, most wondrous maps ever produced by humankind” and former UK prime minister Tony Blair called it “a revolution in medical science whose implications far surpass even the discovery of antibiotics, the first great technological triumph of the 21st century”.

The Sorbonne Declaration builds on the growing importance of research data in the research landscape. There is a mountain to climb, but the declaration marks the commitment of global university networks to expedite the journey.

Paul Ayris is pro vice-provost of UCL library services and co-chair of the League of European Research Universities Info Community.