Skip to content

Adding ligands to dataset processing#78

Closed
vratins wants to merge 9 commits into
mainfrom
dev_ligands
Closed

Adding ligands to dataset processing#78
vratins wants to merge 9 commits into
mainfrom
dev_ligands

Conversation

@vratins

@vratins vratins commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

PR to include ligands in the static structure when encoding proteins. This is a v1 change so tried to keep it as simple of a change as possible.

  • Ligands are included in protein node types with a is_ligand flag.
  • Tried to minimize major codebase refactors in this PR so no new edge types or node-types.
  • Encoded with 1-hot element type
  • Added tests, as well as a new test pdb file which contains ligands (the PDB file is ~3.2k lines hence the huge diff)

Summary by CodeRabbit

  • New Features
    • Added ligand atom extraction and optional ligand node inclusion in dataset preprocessing (include_ligands).
    • Added ligand-aware cached embeddings via a learned ligand projection when ligand nodes are present.
  • Bug Fixes
    • Improved residue pooling to ignore ligand nodes with sentinel residue indices.
    • Cached embedding encoders now validate embedding_dim at construction, avoiding first-forward inference issues.
  • Dependencies
    • Updated PyMOL Open Source package to pymol-open-source-whl>=3.1.0.4.
  • Tests
    • Expanded dataset, parsing, and encoder tests to cover ligand extraction and new encoder behavior.

Copilot AI review requested due to automatic review settings March 30, 2026 20:14
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a6d177f-82a6-4517-83c2-383b7a604696

📥 Commits

Reviewing files that changed from the base of the PR and between c678fc0 and aca3180.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • pyproject.toml
  • scripts/generate_esm_embeddings.py
  • scripts/generate_slae_embeddings.py
  • src/dataset.py
  • src/encoder_base.py
  • src/gvp_encoder.py
  • tests/test_dataset.py
  • tests/test_encoder.py
💤 Files with no reviewable changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/test_dataset.py
  • src/dataset.py

📝 Walkthrough

Walkthrough

The pull request extends a protein dataset pipeline to support ligand atoms. Dataset parsing now extracts ligands as a third return value; preprocessing optionally includes ligand coordinates and features with sentinel markers and boolean masks. The embedding encoder adds a learnable ligand projection layer. Embedding scripts, pooling operations, and tests were updated for the 3-tuple return and new ligand behavior.

Changes

Ligand Support in Dataset Pipeline

Layer / File(s) Summary
Dependency Updates
pyproject.toml
Updated pymol-open-source to pymol-open-source-whl>=3.1.0.4.
Ligand Extraction Function
src/dataset.py
Extended parse_asu_with_biotite() to extract non-amino-acid, non-water heavy atoms via ligand mask and return a 3-tuple (protein_atoms, water_atoms, ligand_atoms).
Dataset Constructor & Cache Configuration
src/dataset.py
Added include_ligands: bool = False parameter to ProteinWaterDataset; documentation describes ligand behavior; cache directory suffix appends "_lig" when enabled; state stored for preprocessing.
Preprocessing: Ligand Integration & Caching
src/dataset.py
Preprocessing unpacks 3-tuple from parse_asu_with_biotite(); when include_ligands is enabled and ligands exist, appends ligand coordinates and one-hot element features after protein nodes; sets ligand residue indices to -1 sentinel; creates and caches is_ligand boolean mask.
Data Loading with Backward Compatibility
src/dataset.py
__getitem__ loads is_ligand from cache and attaches it to data["protein"].is_ligand for graph construction.
Encoder Ligand Handling
src/encoder_base.py
CachedEmbeddingEncoder now requires embedding_dim at construction; added learnable ligand_embed projection; forward() replaces cached embeddings for ligand nodes with learned projections when is_ligand mask is present; from_config requires embedding_dim in config.
GVP Residue Pooling Update
src/gvp_encoder.py
Updated _pool_by_residue() to filter out ligand atoms (residue_index < 0) before scatter-based aggregation.
Embedding Script Updates
scripts/generate_esm_embeddings.py, scripts/generate_slae_embeddings.py
Updated to unpack 3-tuple from parse_asu_with_biotite(); SLAE script clarified as legacy in module docstring.
Ligand Parsing Test Coverage
tests/test_dataset.py
Updated existing tests to expect 3-tuple return; added TestLigandParsing validating ligand extraction semantics; added TestLigandNodeIntegration validating dataset behavior with include_ligands parameter.
Integration Test Updates
tests/test_dataset.py
Updated quality-filter and water-filtering integration tests to unpack 3-tuple from parse_asu_with_biotite().
Encoder Test Updates
tests/test_encoder.py
Updated all cached encoder tests to thread embedding_dim through construction; removed assertions that output_dims was inferred after first forward; validated embedding_dim requirement and added parameter-count assertions for ligand projection.

Sequence Diagrams

sequenceDiagram
    participant User
    participant Dataset as ProteinWaterDataset
    participant Parser as parse_asu_with_biotite()
    participant Preprocess
    participant Encoder as CachedEmbeddingEncoder

    User->>Dataset: init(include_ligands=True)
    Dataset->>Parser: parse_asu_with_biotite(pdb_path)
    Parser-->>Dataset: (protein_atoms, water_atoms, ligand_atoms)
    
    Dataset->>Preprocess: build node tensors with is_ligand
    alt include_ligands=True & ligands exist
        Preprocess->>Preprocess: append ligand coordinates & features
        Preprocess->>Preprocess: set residue_index=-1 for ligands
        Preprocess->>Preprocess: create is_ligand boolean mask
    else include_ligands=False
        Preprocess->>Preprocess: is_ligand = all False
    end
    Preprocess-->>Dataset: cached data with is_ligand tensor
    
    User->>Dataset: __getitem__(idx)
    Dataset-->>User: graph with optional ligand nodes & is_ligand mask
    
    User->>Encoder: forward(data)
    alt is_ligand mask present & any True
        Encoder->>Encoder: ligand_embed(protein.x[ligand_mask])
        Encoder->>Encoder: replace cached embeddings for ligand rows
    else no ligand mask
        Encoder->>Encoder: return cached embeddings unchanged
    end
    
    Encoder-->>User: embeddings with learned ligand projections
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes introduce new feature logic (ligand extraction and encoding) across multiple interconnected files. Dataset parsing returns an additional value type; preprocessing conditionally appends ligand data with sentinel markers and boolean masking; the encoder adds learnable projection logic with conditional device-aware execution paths. Comprehensive test coverage adds validation complexity across parsing, integration, and encoder layers. The variety of heterogeneous changes across dataset, encoder, pooling, scripts, and tests warrants substantial review effort.

Poem

A ligand hops in, tail held high,
No longer lost, no longer shy,
With is_ligand marks, bright and true,
The encoder learns what ligands do—
One hop, one prep, one learned projection,
Our protein graph finds new direction! 🐰✨

Suggested reviewers

  • marcuscollins
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Adding ligands to dataset processing' accurately captures the main change: extending dataset processing to include ligand atoms alongside proteins and water molecules.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_ligands

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@vratins vratins linked an issue Mar 30, 2026 that may be closed by this pull request

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends dataset preprocessing to include non-water ligands in the static structure representation, folding ligand atoms into the existing protein node type and tagging them via an is_ligand boolean mask.

Changes:

  • Update PDB parsing to return (protein_atoms, water_atoms, ligand_atoms) and (optionally) append ligands to protein nodes via include_ligands.
  • Add protein.is_ligand to cached geometry and HeteroData for downstream masking/conditioning.
  • Add integration tests and a new ligand-containing PDB fixture (4h0b) to validate parsing and dataset behavior.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/dataset.py Adds ligand parsing/output, include_ligands option, and persists is_ligand in caches and HeteroData.
src/encoder_base.py Adds a learned projection for ligand rows when using cached-embedding encoders (ESM/SLAE).
tests/test_dataset.py Expands parsing tests for 3-way partitioning and adds dataset integration tests for ligand inclusion.
tests/conftest.py Adds a pdb_4h0b fixture for ligand test coverage.
pyproject.toml Adds rdkit and swaps the PyMOL package reference to pymol-open-source-whl.
uv.lock Updates locked dependency set to reflect dependency changes (incl. rdkit/jaxtyping/pymol package rename).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dataset.py
Comment thread src/dataset.py
Comment thread src/dataset.py Outdated
Comment thread pyproject.toml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/test_dataset.py (1)

566-697: Add one cached-embedding ligand regression test.

These cases only exercise encoder_type="gvp", but the new ligand-specific behavior lives in CachedEmbeddingEncoder.forward() for esm/slae. A tiny fixture-backed test that asserts ligand rows are replaced would cover the path most likely to regress.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_dataset.py` around lines 566 - 697, Add a small integration test
in tests/test_dataset.py that exercises CachedEmbeddingEncoder.forward for
encoder_type="esm" (or "slae") to prevent regressions: construct a
ProteinWaterDataset with include_ligands=True and preprocess=True using a
fixture-backed cached-embedding file (or reuse the existing pdb fixtures), then
fetch data = ds[0], locate ligand rows via data["protein"].is_ligand, and assert
that those ligand-row embeddings in data["protein"].x match the expected cached
embeddings (i.e., are replaced by the cached values) and differ from the
original one-hot element encoding; reference CachedEmbeddingEncoder.forward and
ProteinWaterDataset to find the encoder flow. Ensure the test is small, uses the
tmp_path fixture for any temp files, and only targets the cached-embedding path
for esm/slae.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/dataset.py`:
- Around line 73-81: The current ligand selection uses ligand_mask =
~protein_mask & ~water_mask which includes any non-protein/non-water ATOM
records; instead restrict ligands to HETATM entries by using the Biotite flag
atoms.hetero: update ligand_mask to also require atoms.hetero (e.g., ligand_mask
= (~protein_mask & ~water_mask) & atoms.hetero) so ligand_atoms contains only
non-protein, non-water HETATM records; keep protein_mask, water_mask,
protein_atoms, water_atoms assignments unchanged and return protein_atoms,
water_atoms, ligand_atoms.
- Around line 1070-1078: _pool_by_residue will fail if residue_index contains -1
for ligands because scatter operations (scatter_mean, scatter_add, scatter_max)
cannot accept negative indices; before any scatter keyed by residue_index (in
_pool_by_residue and any other pooling guarded by pool_residue), filter out
ligand atoms using the is_ligand mask (e.g., mask = ~is_ligand) and pass atom
embeddings and residue_index masked arrays (atom_embed[mask],
residue_index[mask]) to scatter with dim_size=num_residues; apply the same
masking wherever residue_index is used for pooling to avoid negative-index
errors.

In `@src/encoder_base.py`:
- Around line 182-190: The ligand_embed Linear is being created lazily in
forward(), so its parameters are not registered when optimizers/DDP/FSDP capture
model parameters; move the nn.Linear initialization into __init__ by creating
self.ligand_embed there (using the given embedding_dim) and remove the
forward-time instantiation, or if embedding_dim truly isn’t known at
construction, add an explicit initialize_ligand_embed(embedding_dim) method that
constructs and registers self.ligand_embed before the optimizer is created (call
it in setup code), ensuring parameters are registered with model.parameters();
as a last-resort alternative, if you cannot construct before optimizer, call
optimizer.add_param_group(...) with the new parameters immediately after
creation (note this still may not work with DDP/FSDP), and update references to
ligand_embed in forward() to assume it already exists.

---

Nitpick comments:
In `@tests/test_dataset.py`:
- Around line 566-697: Add a small integration test in tests/test_dataset.py
that exercises CachedEmbeddingEncoder.forward for encoder_type="esm" (or "slae")
to prevent regressions: construct a ProteinWaterDataset with
include_ligands=True and preprocess=True using a fixture-backed cached-embedding
file (or reuse the existing pdb fixtures), then fetch data = ds[0], locate
ligand rows via data["protein"].is_ligand, and assert that those ligand-row
embeddings in data["protein"].x match the expected cached embeddings (i.e., are
replaced by the cached values) and differ from the original one-hot element
encoding; reference CachedEmbeddingEncoder.forward and ProteinWaterDataset to
find the encoder flow. Ensure the test is small, uses the tmp_path fixture for
any temp files, and only targets the cached-embedding path for esm/slae.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abe0f086-5c49-47da-8543-bf16a9391a11

📥 Commits

Reviewing files that changed from the base of the PR and between b9c0ce9 and bebf742.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • pyproject.toml
  • src/dataset.py
  • src/encoder_base.py
  • tests/conftest.py
  • tests/test_dataset.py
  • tests/test_files/4h0b/4h0b_final.pdb

Comment thread src/dataset.py
Comment thread src/dataset.py Outdated
Comment thread src/encoder_base.py Outdated
Copilot AI review requested due to automatic review settings June 24, 2026 01:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

Comment thread src/dataset.py
Comment thread src/encoder_base.py Outdated
Copilot AI review requested due to automatic review settings June 24, 2026 01:13
@vratins vratins removed the request for review from Copilot June 24, 2026 01:13
@vratins vratins closed this Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ligand Encoding

2 participants