Graph machine learning can estimate drug concentrations in whole blood from forensic screening results
This repository implements a chemistry-informed Graph Neural Network (GNN) that predict the LC-HRMS signal-to-concentration ratio library for drugs in whole blood, trained on a dataset of 191 different molecules. The data is in the notebook and can also be accessed on .
The GNN model is directly inspired by TChemGNN. Molecules are converted from SMILES into graphs where each atom node carries rich structural information (i.e., aromaticity, charge, valence, hybridization, mass-based descriptors), and each node is additionally augmented with global geometry features (molecular volume, length, width, height) to give the model full-molecule context beyond connectivity. A multi-layer Graph Attention Network (GAT) learns both local substructure effects and broader molecular shape.
The workflow includes graph construction, feature assembly, and a LOOCV training strategy optimized for our small chemical dataset.
The code reproduces the experiments in the paper "Graph machine learning can estimate drug concentrations in whole blood from forensic screening results" available on ChemRxiv and under review for publication.
The notebook in this repository run the training of the GNN model to reproduce the results in the publication. The notebook can be run in Google Colab or on a stand alone computer but a GPU is highly recommended for faster training of the GNN.
The work “Efficient Learning of Molecular Properties Using Graph Neural Networks Enhanced with Chemistry Knowledge” by the same authors demonstrates that combining classical chemical insight with graph neural networks (GNNs) can substantially improve molecular property prediction.
In our project (own LOD-library-191-molecules-LC–MS + GNN), we draw strong inspiration from these ideas — and integrate them into a specialized LC–MS context:
We encode atom-level descriptors (aromaticity, ring membership, degree, valence, formal charge, atomic number, hybridization, hydrogen count, mass-based scaling, etc.) — capturing electronic, steric, and topological aspects. This mirrors and extends the philosophy of combining local and global chemical features.
As in Efficient-ChemGNN, we acknowledge that bonds alone may not capture all relevant chemical context for LC–MS signal behavior — therefore we allow inclusion of molecular‐level descriptors such as volumes, widths, lengths, heights (if available). This helps the GNN to “see” beyond just connectivity and get structural context relevant for ionization or fragmentation in LC–MS.
Our use of GAT layers aligns with the attention-based message-passing architecture favored in GNN chemical modeling. Attention allows the network to weigh different atoms/substructures differently, analogous to how certain functional groups or atom environments contribute more strongly to LC–MS response.
Our model is built to work with limited data, leveraging chemistry-informed features and architecture choices to maximize predictive power despite small sample size.
Unlike many generic molecular‐property predictors, our target is the signal/concentration behavior in LC–MS on a new original dataset.
This codes builds on the foundational ideas demonstrated in TChemGNN, applying them to an LC–MS–oriented molecular library. By combining atom-level chemical descriptors, global molecular features, and a GAT-based graph architecture, we aim to deliver a data-efficient, chemically informed, and practically usable GNN-based prediction framework for LC–HRMS concentration.