bbbp-overrepresentation-bias

This repository contains the official code and data to reproduce the findings of the paper "Overrepresentation bias leads to performance overestimation in blood-brain barrier permeability prediction models: characterization and mitigation".

Abstract

Recent advancements in blood-brain barrier permeability (BBBP) prediction of drug compounds have highlighted the growing role of machine learning, particularly deep learning. While considerable attention has been given to feature engineering and model design, their evaluation often receives insufficient attention despite its fundamental role in model credibility. In this work, we study a phenomenon we term overrepresentation bias, susceptible to be found in drug property databases, characterized by the presence of near-identical compounds with the same or nearly identical property values. Our findings reveal that overrepresentation bias leads to overly optimistic performance estimates in BBBP prediction models, inflating test evaluation metrics—13.7% in average for the area under curve and 16.64% in average for the F1-score. To address this bias, we propose (i) an automatic detection algorithm and (ii) a bias-aware data handling procedure. We recommend adopting this approach to ensure more reliable model evaluations. Given that overrepresentation bias can affect performance estimation more than feature selection, model architecture, or even training data, we urge both academic and industrial communities to acknowledge its significance and take proactive measures to identify and address this bias in future studies.

Repository Structure

The repository is organized to separate the code, data, and configuration files.

bbbp-overrepresentation-bias/
│
├── Code/
│   ├── ThresholdCalculation/
│   │   ├── main_thrcal.py          # Main script to run the threshold calculation
│   │   └── functions_thrcal.py     # Core functions for Beta mixture model estimation
│   │
│   └── OverrepresentationAwareDataSplitting/
│       ├── main_ovrsplit.py        # Main script to perform the data splitting
│       └── functions_ovrsplit.py   # Helper functions for splitting and filtering
│
├── Data/
│   ├── original_data.csv         # Source data before processing
│   └── prepared_data.csv         # Processed and cleaned data
│
├── LICENSE                         # Apache 2.0 License file
├── README.md                       # This file
└── requirements.txt                # Python dependencies

Methodology Overview

Our approach is divided into two main parts: first, a data-driven method to define a similarity threshold for "approximate collisions," and second, a rigorous data splitting procedure that uses this threshold to create bias-aware evaluation sets.

1. Automatic Threshold Calculation for Approximate Collisions

We developed an automated procedure to define a Jaccard distance threshold, below which two compounds are considered near-duplicates (approximate collisions).

Nearest-Neighbor Distances: For each compound in the training set (after removing exact ECFP collisions), we calculate the Jaccard distance to its closest neighbor.
Mixture Modeling: We model the empirical distribution of these nearest-neighbor distances using four candidate Beta mixture models:
- A single Beta distribution.
- A two-component Beta mixture (left-heavy).
- A two-component Beta mixture (right-heavy), which explicitly models a cluster of near-duplicates.
- A three-component Beta mixture, which models near-duplicates, standard compounds, and outliers.
Penalized Expectation-Maximization (EM): The model parameters are estimated using a penalized EM algorithm. The objective function includes two soft constraints to ensure the components are well-separated and interpretable.
Model Selection: The best-fitting model is selected based on the Bayesian Information Criterion (BIC).
Threshold Determination: If the selected model identifies a distinct cluster of near-duplicates (i.e., the right-heavy two-component or the three-component model), the threshold is set as the Jaccard distance where the densities of the near-duplicate component and the standard component intersect.

2. Overrepresentation-Aware Data Splitting

To quantify the impact of overrepresentation bias, we designed a multi-stage splitting protocol that generates three parallel train/test dataset pairs, each with a stricter level of filtering.

Stage 1: Initial Stratified Split (InChI sets)
- The dataset is split into training (75%) and testing (25%) sets using stratification to preserve class balance.
Stage 2: Filtering Exact Collisions (exact sets)
- Compounds with identical ECFP fingerprints but different InChI keys are removed.
- Internal duplicates are removed from the train and test sets.
- Any compound in the test set that is an exact collision of a compound in the training set is removed.
Stage 3: Filtering Approximate Collisions (exact_approximate sets)
- Using the automatically calculated threshold $\tau$ , we remove approximate collisions.
- Internal near-duplicates are removed from the train and test sets.
- Any compound in the test set that is an approximate collision of a compound in the training set is removed.
Stage 4: Harmonizing Test Sets
- To ensure a fair comparison, the InChI and exact test sets are down-sampled via stratified sampling to match the final size and class distribution of the smallest test set (exact_approximate).

This process results in three sets of data—InChI, exact, and exact_approximate—each containing a training set, a test set, and cross-validation folds derived from the training data.

Data

As stated in the paper, "The source data, along with the processed data presented in later sections of this study, can be found at our GitHub repository."

The Data/ folder contains two key files:

original_data.csv: The raw, source data used for the study.
prepared_data.csv: The data after initial cleaning, feature extraction, and preparation, which serves as the input for the methods described here.

The scripts in this repository may require further processing of these files (e.g., generating distance matrices and saving them as pickled objects) before execution.

Getting Started

Prerequisites

You will need Python 3.8+. You can install all required libraries from the requirements.txt file:

pip install -r requirements.txt

Usage

1. Calculate the Similarity Threshold

This step fits the Beta mixture models to the nearest-neighbor distance distribution to find the optimal threshold $\tau$ .

Configure the script: Open Code/ThresholdCalculation/main_thrcal.py and update the CONFIG dictionary to point to your specific input file (which you may need to generate from the prepared_data.csv).
```
CONFIG = {
    # ...
    'data_directory': 'path/to/your/processed_data/',
    'filename': 'your_nearest_neighbor_distances.pkl', # <-- UPDATE THIS
    # ...
}
```

Run the script:

python Code/ThresholdCalculation/main_thrcal.py

The script will print the BIC scores for all candidate models, identify the best model, calculate the threshold, and display plots of the fitted densities.

2. Perform Overrepresentation-Aware Data Splitting

This step uses the threshold from the previous part to execute the multi-stage data splitting protocol.

Configure the script: Open Code/OverrepresentationAwareDataSplitting/main_ovrsplit.py and ensure the global constant THRESHOLD_APPROXIMATE_COLLISIONS matches the value you calculated.
```
# ...
INCHI_COLUMN: Final[str] = 'ICI'
THRESHOLD_APPROXIMATE_COLLISIONS: Final[float] = 0.062 # <-- VERIFY OR UPDATE THIS
```

Run the script from the command line: This script takes the paths to your prepared dataset and the corresponding distance matrix as arguments.

python Code/OverrepresentationAwareDataSplitting/main_ovrsplit.py \
  --data_path path/to/your/prepared_dataset.pkl \
  --distances_path path/to/your/distances_matrix.pkl

The script will execute the entire splitting and filtering pipeline and print a confirmation upon completion. The final output is a nested dictionary containing all the generated data splits.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Authors and Contact

This work was conducted at the Biomedical Data Science Lab (BDSLab) at the ITACA Institute, Universitat Politècnica de València (UPV), Spain.

Pablo Ferri (Main Developer)
- Email: pabferb2@upv.es
Juan M. García-Gómez
- Email: juanmig@ibime.upv.es

For any questions regarding the code or the paper, please feel free to contact Pablo Ferri.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bbbp-overrepresentation-bias

Abstract

Repository Structure

Methodology Overview

1. Automatic Threshold Calculation for Approximate Collisions

2. Overrepresentation-Aware Data Splitting

Data

Getting Started

Prerequisites

Usage

1. Calculate the Similarity Threshold

2. Perform Overrepresentation-Aware Data Splitting

License

Authors and Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Code		Code
Data		Data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

bbbp-overrepresentation-bias

Abstract

Repository Structure

Methodology Overview

1. Automatic Threshold Calculation for Approximate Collisions

2. Overrepresentation-Aware Data Splitting

Data

Getting Started

Prerequisites

Usage

1. Calculate the Similarity Threshold

2. Perform Overrepresentation-Aware Data Splitting

License

Authors and Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages