This repository contains the official code and data to reproduce the findings of the paper "Overrepresentation bias leads to performance overestimation in blood-brain barrier permeability prediction models: characterization and mitigation".
Recent advancements in blood-brain barrier permeability (BBBP) prediction of drug compounds have highlighted the growing role of machine learning, particularly deep learning. While considerable attention has been given to feature engineering and model design, their evaluation often receives insufficient attention despite its fundamental role in model credibility. In this work, we study a phenomenon we term overrepresentation bias, susceptible to be found in drug property databases, characterized by the presence of near-identical compounds with the same or nearly identical property values. Our findings reveal that overrepresentation bias leads to overly optimistic performance estimates in BBBP prediction models, inflating test evaluation metrics—13.7% in average for the area under curve and 16.64% in average for the F1-score. To address this bias, we propose (i) an automatic detection algorithm and (ii) a bias-aware data handling procedure. We recommend adopting this approach to ensure more reliable model evaluations. Given that overrepresentation bias can affect performance estimation more than feature selection, model architecture, or even training data, we urge both academic and industrial communities to acknowledge its significance and take proactive measures to identify and address this bias in future studies.
The repository is organized to separate the code, data, and configuration files.
bbbp-overrepresentation-bias/
│
├── Code/
│ ├── ThresholdCalculation/
│ │ ├── main_thrcal.py # Main script to run the threshold calculation
│ │ └── functions_thrcal.py # Core functions for Beta mixture model estimation
│ │
│ └── OverrepresentationAwareDataSplitting/
│ ├── main_ovrsplit.py # Main script to perform the data splitting
│ └── functions_ovrsplit.py # Helper functions for splitting and filtering
│
├── Data/
│ ├── original_data.csv # Source data before processing
│ └── prepared_data.csv # Processed and cleaned data
│
├── LICENSE # Apache 2.0 License file
├── README.md # This file
└── requirements.txt # Python dependencies
Our approach is divided into two main parts: first, a data-driven method to define a similarity threshold for "approximate collisions," and second, a rigorous data splitting procedure that uses this threshold to create bias-aware evaluation sets.
We developed an automated procedure to define a Jaccard distance threshold, below which two compounds are considered near-duplicates (approximate collisions).
-
Nearest-Neighbor Distances: For each compound in the training set (after removing exact ECFP collisions), we calculate the Jaccard distance to its closest neighbor.
-
Mixture Modeling: We model the empirical distribution of these nearest-neighbor distances using four candidate Beta mixture models:
- A single Beta distribution.
- A two-component Beta mixture (left-heavy).
- A two-component Beta mixture (right-heavy), which explicitly models a cluster of near-duplicates.
- A three-component Beta mixture, which models near-duplicates, standard compounds, and outliers.
-
Penalized Expectation-Maximization (EM): The model parameters are estimated using a penalized EM algorithm. The objective function includes two soft constraints to ensure the components are well-separated and interpretable.
-
Model Selection: The best-fitting model is selected based on the Bayesian Information Criterion (BIC).
-
Threshold Determination: If the selected model identifies a distinct cluster of near-duplicates (i.e., the right-heavy two-component or the three-component model), the threshold is set as the Jaccard distance where the densities of the near-duplicate component and the standard component intersect.
To quantify the impact of overrepresentation bias, we designed a multi-stage splitting protocol that generates three parallel train/test dataset pairs, each with a stricter level of filtering.
-
Stage 1: Initial Stratified Split (
InChIsets)- The dataset is split into training (75%) and testing (25%) sets using stratification to preserve class balance.
-
Stage 2: Filtering Exact Collisions (
exactsets)- Compounds with identical ECFP fingerprints but different InChI keys are removed.
- Internal duplicates are removed from the train and test sets.
- Any compound in the test set that is an exact collision of a compound in the training set is removed.
-
Stage 3: Filtering Approximate Collisions (
exact_approximatesets)- Using the automatically calculated threshold
$\tau$, we remove approximate collisions. - Internal near-duplicates are removed from the train and test sets.
- Any compound in the test set that is an approximate collision of a compound in the training set is removed.
- Using the automatically calculated threshold
-
Stage 4: Harmonizing Test Sets
- To ensure a fair comparison, the
InChIandexacttest sets are down-sampled via stratified sampling to match the final size and class distribution of the smallest test set (exact_approximate).
- To ensure a fair comparison, the
This process results in three sets of data—InChI, exact, and exact_approximate—each containing a training set, a test set, and cross-validation folds derived from the training data.
As stated in the paper, "The source data, along with the processed data presented in later sections of this study, can be found at our GitHub repository."
The Data/ folder contains two key files:
original_data.csv: The raw, source data used for the study.prepared_data.csv: The data after initial cleaning, feature extraction, and preparation, which serves as the input for the methods described here.
The scripts in this repository may require further processing of these files (e.g., generating distance matrices and saving them as pickled objects) before execution.
You will need Python 3.8+. You can install all required libraries from the requirements.txt file:
pip install -r requirements.txtThis step fits the Beta mixture models to the nearest-neighbor distance distribution to find the optimal threshold $\tau$.
- Configure the script: Open
Code/ThresholdCalculation/main_thrcal.pyand update theCONFIGdictionary to point to your specific input file (which you may need to generate from theprepared_data.csv).CONFIG = { # ... 'data_directory': 'path/to/your/processed_data/', 'filename': 'your_nearest_neighbor_distances.pkl', # <-- UPDATE THIS # ... }
- Run the script:
python Code/ThresholdCalculation/main_thrcal.py
The script will print the BIC scores for all candidate models, identify the best model, calculate the threshold, and display plots of the fitted densities.
This step uses the threshold from the previous part to execute the multi-stage data splitting protocol.
-
Configure the script: Open
Code/OverrepresentationAwareDataSplitting/main_ovrsplit.pyand ensure the global constantTHRESHOLD_APPROXIMATE_COLLISIONSmatches the value you calculated.# ... INCHI_COLUMN: Final[str] = 'ICI' THRESHOLD_APPROXIMATE_COLLISIONS: Final[float] = 0.062 # <-- VERIFY OR UPDATE THIS
-
Run the script from the command line: This script takes the paths to your prepared dataset and the corresponding distance matrix as arguments.
python Code/OverrepresentationAwareDataSplitting/main_ovrsplit.py \ --data_path path/to/your/prepared_dataset.pkl \ --distances_path path/to/your/distances_matrix.pkl
The script will execute the entire splitting and filtering pipeline and print a confirmation upon completion. The final output is a nested dictionary containing all the generated data splits.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
This work was conducted at the Biomedical Data Science Lab (BDSLab) at the ITACA Institute, Universitat Politècnica de València (UPV), Spain.
-
Pablo Ferri (Main Developer)
- Email:
pabferb2@upv.es
- Email:
-
Juan M. García-Gómez
- Email:
juanmig@ibime.upv.es
- Email:
For any questions regarding the code or the paper, please feel free to contact Pablo Ferri.