Data science should learn to speak domain languages

ratnikov-fedor-780

HSE University’s Laboratory of Methods for Big Data Analysis utilises machine learning to benefit academia and industry

The digital revolution is accelerating, and data form the backbone of innovation and industrial development. As a result, governments and businesses are developing machine-learning algorithms and other data-driven techniques that offer new services and products to citizens.

Science is also benefitting from an explosion in data, as researchers in different fields generate information in unprecedented quantities. But off-the-shelf data analysis solutions are often not up to the complicated rigour required by scientists, which is where the Laboratory of Methods for Big Data Analysis (LAMBDA) comes in.

Part of Russia’s HSE University, the laboratory was established in 2015 with support from Yandex, a leading Russian IT company, to fulfil this need, explains Fedor Ratnikov (pictured), a senior scientist at LAMBDA. “For industry, when we develop a data-science method, usually the problem is not complicated but the solution should be extremely reliable,” he says, citing the example of customer billing for a cellular company. In contrast, “for science, the problem is usually pretty complicated”, but the lab can experiment with different ideas and try a variety of methods.

A fundamental aspect of the laboratory’s work is to collaborate with other scientists to see how data science could help them address problems specific to their fields. “Basically, we translate the problem out of the language of their science into the language of data science,” says Denis Derkach, another of the lab’s senior scientists. The lab’s researchers then collaborate with the scientists to ensure that the solution they are proposing is relevant and useful.

“There are quite a few people, from astrophysicists to neuroscientists, who collaborate with us on challenges that machine learning can solve in their fields,” says Professor Derkach.

As an associate member of CERN’s Large Hadron Collider beauty experiment (LHCb), LAMBDA has experience in using machine-learning techniques to demystify large quantities of complex data. The collider is a giant particle accelerator, located about 100m underground, which straddles the border between France and Switzerland. Super-conducting magnets in the accelerator guide two beams of particles – protons or lead atoms – around the 27km ring and smash them into each other in a bid to understand the particles that constitute matter.

The LHCb investigates the differences between matter and antimatter through studying the behaviour of an elementary particle called a “beauty quark” or “b quark”. The experiment is currently being upgraded, and when the Large Hadron Collider fires up again in 2021, data will flow from the LHCb at a remarkable four terabytes a second.

It would not be possible for a person – or even a team of people working day and night – to derive meaning and patterns from such a large quantity of data. This is why high-energy physicists need data scientists and their machine-learning techniques.

“We see great potential for developing machine-learning in high-energy physics,” said laboratory head Andrey Ustyuzhanin at the announcement of the Cern-LAMBDA collaboration in 2018. “At the same time, the problems that arise in physics serve as the impetus for new approaches in machine-learning that can be applied to other areas as well. Accurate and fast tuning of physics simulation software is a good example of such a problem.”

Through its collaboration, researchers at LAMBDA have full access to the experiment’s data, says Professor Ratnikov. “Cern is a major driver of our efforts, and at least half of the laboratory is involved.”

However, the laboratory is also keeping its eye on future experiments, such as the Search for Hidden Particles experiment (SHiP) at Cern, he says. SHiP, a collaboration of 54 institutes from 18 countries, will be a general-purpose experiment.

“We have to consider our future after the Large Hadron Collider stops operation,” says Professor Ratnikov. “It is important to be involved in future experiments.”

The laboratory is also involved in Nica, also known as the Nuclotron-based Ion Collider fAсility. The new accelerator complex, designed at the Joint Institute for Nuclear Research in Dubna, Russia, is to be commissioned next year.

Nica will house a multi-purpose detector on its collider, and “we are trying to introduce cutting-edge machine-learning technologies there”, says Professor Ratnikov. “We’re trying to implement generative models to speed up the simulation of the detector. This is the most computer-consuming part of particle physics.”

To do this, Professor Derkach says they want to include current techniques, not ones that were developed five years ago that are already out of date in the fast-moving world of data science.

While the laboratory’s main goal is finding solutions for scientists, it also performs consultancy work for industry and government – and these solutions can find their way back into scientific laboratories. Professor Derkach gives the example of an early-warning system for data centres. “We have constructed a system which predicts that there could be a failure somewhere in a data centre’s system,” he explains. “It gives early warning to the operator of the centre.”

Those techniques could also help them detect anomalies in the data obtained by the LHCb experiment. Anomalies occur when the detector appeared to behave differently from usual.

But while the laboratory is working to solve real-world industrial and scientific problems, it is also part of the HSE’s computer science faculty. “Our laboratory is deeply involved in education,” says Professor Derkach. LAMBDA’s staff, which represent five nationalities, also train HSE students in data science theory and its practical applications.

Teamwork underpins this teaching. Current large scientific programmes – such as high-energy physics experiments on the LHC – involve hundreds of scientists working together to untangle complex scientific mysteries. To do that, they need to collaborate.

Data science is “not an exact science, like mathematics,” Professor Ratnikov says. Experience is crucial in being able to develop the algorithms and solutions. “And the exchange of experience speeds up developments. Communication, teamwork, discussing ideas, they are all very important in data science. We teach it our students by our own experience and with the help of advanced collaboration techniques.”

Find out more about HSE University