In 2003, the last of the 3 billion base pairs that make up the human genome were finally catalogued. Researchers pushed hard to use that blueprint to decipher how our individual genetic signatures put us at risk of developing diseases ranging from cancer to Alzheimer’s. Prior to that landmark achievement, biomedical researchers seeking to understand and cure disease were hamstrung by the ebb and flow of disconnected pools of data and the lack of tools to easily analyse all that information.
Now data are ubiquitous and readily accessible. Anyone with a smartphone can make a sophisticated query of the complete human genome. Data and discoveries are transmitted around the globe through online data repositories, journals and even social media.
This meteoric explosion of new data isn’t limited to biomedical research. The Integrated Public Use Microdata Series, the world’s largest population database, enables anyone to mine US census records dating back to before the Civil War. Imagine being able to examine how unemployment during the Great Depression affected individual households across various demographics. Now think about the possibilities of using those data to shape contemporary economic and social policy.
The importance of leveraging data to improve the human condition has entered the national conversation in the US, and rightly so. The White House hired the nation’s first chief data scientist in 2015 and the government is making available online huge amounts of data. Through a single web portal, open.whitehouse.gov, anyone can obtain a Medicare analysis about what hospitals charge for common inpatient procedures; locate alternative fuel stations across the country; track, in real time, earthquakes around the world; or download a college planning tool to help students and families select the right institution for them.
The digital universe is expected to grow to around 44 zettabytes (44 x 1021 bytes) by 2020. By comparison, in 2009, the entire World Wide Web was estimated to fill just half a zettabyte. A zettabyte contains roughly 1,000 exabytes; Cisco Systems reports that a single exabyte can stream the entire Netflix catalogue more than 3,000 times. But consider this: only 0.5 per cent of all data is ever analysed and used at present. This is deeply worrying. At best, important knowledge is being left on the table. At worst, incorrect or incomplete analyses are leading to flawed conclusions and outcomes.
The problem is that the ability to manage and interpret large datasets, via techniques such as machine learning, largely remains the purview of high-level programmes in informatics, engineering and biomedical research. The majority of students and scholars don’t know how to manipulate, visualise and analyse data.
To chart a new course, we first must move beyond the debate around the merits of liberal arts versus technical disciplines, which divides educational institutions, government agencies and families. Big data has relevancy across all fields and is already reshaping some humanities fields.
Case in point: at Tufts University, students in digital humanities study the creation, transmission, preservation and transformation of knowledge across time and cultures, from antiquity to the early modern period. Tufts’ interactive Perseus Digital Library gives students, teachers and scholars around the world access to a broad collection of documents spanning this wide period, in languages including Latin, Greek and Arabic. It receives nearly 100 million hits a year and has transformed the study of Classics.
We must remain committed to the value of a liberal arts education in promoting critical thinking, complex problem solving, sophisticated written and oral communication and a deep understanding of the human condition. But we must integrate it with the quantitative mindset, and with a core competence in data science, computational science and analytics. That way, our students will be prepared to solve society’s most urgent problems.
Runaway big data is not just an academic issue, though. In healthcare, data from electronic patient records are informing treatment protocols. In national security, big data and analytics are the first line of defence against cyber attacks. And in business, the applications for big data are truly limitless. We need managers and analysts who know how to use big data to assist in making effective decisions, but a McKinsey report projects that in just two years, there could be 1.5 million such roles unfilled in the US alone.
Ideally, every undergraduate would be compelled to take a basic data science course, much as most institutions now require proficiency in writing and a foreign language in order to obtain a bachelor’s degree. Such a course would prepare students to analyse and interpret big data, while also making them conversant in the critical ethical, legal and social issues raised by doing so. Faculty support and expertise are essential. More data-adept talent is needed in our classrooms and laboratories. Even many bench scientists currently struggle to manage and harness the data deluge.
History tells us that following the advent of printing, scholars sought new ways to manage unprecedented torrents of information. Colleges and universities have an obligation to help their scholars and students master the tools of data analysis, not for the good of the academy but for the good of us all.
Anthony Monaco, a neuroscientist and geneticist, is president of Tufts University.