Behind the curtains: how multi-task learning is used for drug discovery.

This article has been written by Jaak Simm and Adam Arany from Katholieke Universiteit Leuven

raphael-biscaldi-5PEy9UraJ5c-unsplash.jpg

Multi-task learning for drug discovery

One of the pillars of modern drug discovery is data-driven bioactivity modelling because it allows selecting potent and safe drug candidates. Additionally, even for a complicated disease where the exact mechanism is unknown researchers can use the bioactivity models to identify causally related pathways and proteins.

Even though the data-driven models for bioactivity rely on cutting-edge machine learning and deep learning techniques the key bottleneck is the lack of high-quality activity data. To tackle this challenge machine learning research has drawn inspiration from human learning. Humans often learn by transferring existing knowledge to a new task from similar problems. For example, imagine learning Italian while already knowing Spanish, which shares both vocabulary and grammar structures. This means one can speed up the learning process by adapting the existing language from Spanish.

In the field of machine learning the techniques that learn several tasks together are called multi-task learning methods. In drug discovery, one is interested in modelling compound bioactivity on different proteins and phenotypes. Thus, there is great value in joint modelling of multiple proteins and phenotypes using multi-task learning methods because many proteins share evolutionary history due to gene duplication mechanisms (i.e., paralogs).

MELLODDY: multi-task learning across many pharmaceuticals

The goal of the MELLODDY project is to take this one step further and allow several pharmaceutical partners to perform multi-task learning across each other data sets. This is depicted in the figure below, where each partner shares part of the model while keeping their biological targets private, which is represented in the figure that the tasks of each company are not shared directly with other companies.


MELLODDY-ML-model.png

MELLODDY enables joining modelling of tasks from multiple companies without explicitly sharing the underlying raw training data. In other words, the compound structures and bioactivity values are kept private. To this end, we exploit the latest machine learning techniques as well as advanced cryptographic technologies for privacy.

Open-source collaboration

Our team in KU Leuven strongly believes in the value of open-source collaboration, enabling tight cooperation between academics and industry. Therefore, we have released the underlying deep learning tool SparseChem to train bioactivity and toxicity models under an open-source license. The package allows one to train industry-scale models of millions of compounds with high computational efficiency even when very high-dimensional compound features are used as the input, for instance, it is possible to train models with million-dimensional sparse input features, which is often the case for unfolded chemical fingerprints like ECFPs.
SparseChem is available in GitHub and allows the training of both classification and regression models. For classification, many well-established metrics are reported, including AUC-PR, AUC-ROC, F1, and Kappa. For regression, we report R-squared, RMSE, correlation coefficient. Additionally, the package supports censored regression, which is a very common setting in pharmaceutical bioactivity data where many values are reported with "less than" or "greater than" qualifiers.

We are also excited to integrate SparseChem in the open-source Melloddy stack made of Substra and Kubernetes.

Upcoming data sources

In the upcoming, we are looking into exploiting auxiliary data sources to improve the main bioactivity model tasks through the multi-task learning effect. The example sources we are considering now are:

  • High-content single-cell imaging data, which has been proven to contain a strong signal for protein bioactivity.

  • High-throughput screening data containing hundreds of thousands of data points.


Another exciting development is investigating how far the improvements of the learned model reach in the chemistry space. This is known as the applicability domain of a model. On one hand, we expect improvements in the model (due to multi-pharma joint training) to improve the classical performance metrics such as AUC-ROC. On the other hand, joint training should also significantly increase the applicability domain.