Finding the right data preparation recipe

Finding the right data preparation recipe

This article has been written by Ansgar Schuffenhauer from Novartis

MELLODDY is a unique project that aims at bringing together structure-activity data of ten pharma companies in a partnership to build a machine learning model aimed at accelerating R&D research without compromising intellectual property. The idea is based on the underlying concept of “big data”: If we can train models on larger volumes of data, the resulting models can become more complex and can achieve a higher predictivity. In order to protect each company its intellectual property, the approach is “federated”, meaning the data is not collected into a central repository, but remains under the control of the partner owning it. The platform is also set-up in a way to preserve the privacy of each partner so that no partner knows which data the other companies are sharing and how much data each partner contributes exactly. (see here for more details).

Since the project’s kick-off in June 2019, the ten partners have brought together an unprecedented volume of data and have executed a first federated machine-learning run. We knew from the beginning that in machine learning not only the quantity of data matters, but also its quality. This can be compared to successfully preparing the best recipe for a special dish. In each kitchen, ingredients need to be cut, chopped, or diced in the right way in order for the dish to succeed, and the quality of the ingredients as well as careful adherence to these preparations make all the difference.


Likewise, for machine learning the correct data preparation is a key step for success. Therefore, in addition to building the secure platform for machine learning, we have to ensure that all partners are following the same rules when selecting the data to include and deciding how to present it to the machine-learning algorithm. Fundamentally, there are two opposing principles here. On one hand, the data is confidential. Thus, no partner can control the execution of the data preparation by the partners, and the partners must trust each other. On the other hand, we want to ensure common standards for data preparation so that the data is presented in a consistent way, and non-intentional mistakes can be excluded to the best extent possible.

So, what can we do to address this? First, we must write down the “recipe” in a “cookbook”: the more clearly and precisely it is written, the more likely we will get the same quality results from each pharma partner’s data “kitchen”. 


And we can also go one step further: We can share the tools used for preparing large sets of data. Open source software toolkits for data processing and cheminformatics, such as RDKit, are nowadays mature enough to allow us to build a data preparation workflow. This has the advantage that we can co-develop and share the scripts for data pre-processing among the pharma partners without any restrictions that commercial licenses typically entail. At the same time, we can share our data preparation tools with the entire scientific community as open source code: MELLODDY-Tuner

We have now addressed the question on how to ensure that every partner follows the recipe, but there’s another equally important question that needs to be discussed: What is the right recipe in the first place? There are established best practices for some “bread and butter” procedures, and if ten pharma companies combine their expertise, these can be identified quickly. However, for some procedures the answer is less obvious, for example:

  • In addition to using concentration response data, data from high-throughput screening is measured using a single concentration. Can this assay data from high-throughput screening help to boost the performance of predictions on the concentration response assays?

  • Is it worth pursuing regression models that predict the assay outcome in a quantitative way, or should we limit ourselves to classification models, which give a binary prediction (active/inactive)?

In order to answer these questions, ideally, we would need to conduct comparison studies during the federated machine learning runs to identify the optimal procedure. Unfortunately, in practice, this would require too many computational resources. This leaves us with trying out the different “recipes” locally where each individual pharma company uses its own data. For this reason, it’s critical that the “recipes”, or data science experiments, are translated to computer executable code. Such code allows each partner to run the same experiment with their own data in exactly the same way as the original experiment devised by another partner, thus making it possible to compare results across all partners. If the outcome can be reproduced across the datasets of several pharma partners, then we can expect that this will also hold true for the federated machine learning process. 

Stay tuned for the next course.