AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. Inspite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with noise, which is inevitable in real world AIDD applications.
In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for graph OOD learning problems.
DrugOOD provides large-scale, realistic, and diverse datasets for Drug AI OOD research. Specifically, DrugOOD focuses on the problem of domain generalization, in which we train and test the model on disjoint domains, e.g., molecules in a new assay environment. Top Left: Based on the ChEMBL database, we present an automated dataset curator for customizing OOD datasets flexibly. Top Right: DrugOOD releases realized exemplar datasets spanning different domain shifts. In each dataset, each data sample (x, y, d) is associated with a domain annotation d. We use the background colours lightblue and lightgreen to denote the seen data and unseen test data. Bottom: Examples with different noise levels from the DrugOOD dataset. DrugOOD identifies and annotates three noise levels (left to right: core, refined, general) according to several criteria, and as the level increases, data volume increases and more noisy sources are involved.
We construct all the datasets based on ChEMBL, which is a large-scale, open-access drug discovery database that aims to capture medicinal chemistry dataand knowledge across the pharmaceutical research and development process. We use thelatest release in the SQLite format: ChEMBL 29. Moreover, we consider the setting of OOD and different noise levels, which is an inevitable problem when the machine learning model is applied to the drug development process. For example, when predicting SBAP bioactivity in practice, the target protein used in the model inference could be very different from that in the training set and even does not belong to the same protein family. The real-world domain gap will invoke challenges to the accuracy of the model. On the other hand, the data used in the wild often have various kinds of noise, e.g. activities measured through experiments often have different confidence levels and different “cut-off” noise. Therefore, it is necessary to construct data sets with varying levels of noise inorder to better align with the real scenarios.
Overview of the automated dataset curator is shown above. We mainly implement three major steps based on the ChEMBL data source: noise filtering, uncertainty processing, and domain splitting. We have built-in 96 configuration files to generate the realized datasets with the configuration of two tasks, three noise levels, four measurement types, and five domains.
Moreover, DrugOOD provides a user-friendly, customizable data curation process that allows users to regenerate new datasets by simply modifying parameters in the config file. These datasets can take advantage of the huge amount of data available on the inventory website ChEMBL