Adding ligands to dataset processing by vratins · Pull Request #78 · diff-use/WaterFlow

vratins · 2026-03-30T20:14:53Z

PR to include ligands in the static structure when encoding proteins. This is a v1 change so tried to keep it as simple of a change as possible.

Ligands are included in protein node types with a is_ligand flag.
Tried to minimize major codebase refactors in this PR so no new edge types or node-types.
Encoded with 1-hot element type
Added tests, as well as a new test pdb file which contains ligands (the PDB file is ~3.2k lines hence the huge diff)

Summary by CodeRabbit

New Features
- Added ligand atom extraction and optional ligand node inclusion in dataset preprocessing (include_ligands).
- Added ligand-aware cached embeddings via a learned ligand projection when ligand nodes are present.
Bug Fixes
- Improved residue pooling to ignore ligand nodes with sentinel residue indices.
- Cached embedding encoders now validate embedding_dim at construction, avoiding first-forward inference issues.
Dependencies
- Updated PyMOL Open Source package to pymol-open-source-whl>=3.1.0.4.
Tests
- Expanded dataset, parsing, and encoder tests to cover ligand extraction and new encoder behavior.

… for the same

coderabbitai · 2026-03-30T20:15:10Z

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a6d177f-82a6-4517-83c2-383b7a604696

📥 Commits

Reviewing files that changed from the base of the PR and between c678fc0 and aca3180.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

pyproject.toml
scripts/generate_esm_embeddings.py
scripts/generate_slae_embeddings.py
src/dataset.py
src/encoder_base.py
src/gvp_encoder.py
tests/test_dataset.py
tests/test_encoder.py

💤 Files with no reviewable changes (1)

pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (2)

tests/test_dataset.py
src/dataset.py

📝 Walkthrough

Walkthrough

The pull request extends a protein dataset pipeline to support ligand atoms. Dataset parsing now extracts ligands as a third return value; preprocessing optionally includes ligand coordinates and features with sentinel markers and boolean masks. The embedding encoder adds a learnable ligand projection layer. Embedding scripts, pooling operations, and tests were updated for the 3-tuple return and new ligand behavior.

Changes

Ligand Support in Dataset Pipeline

Layer / File(s)	Summary
Dependency Updates `pyproject.toml`	Updated `pymol-open-source` to `pymol-open-source-whl>=3.1.0.4`.
Ligand Extraction Function `src/dataset.py`	Extended `parse_asu_with_biotite()` to extract non-amino-acid, non-water heavy atoms via ligand mask and return a 3-tuple `(protein_atoms, water_atoms, ligand_atoms)`.
Dataset Constructor & Cache Configuration `src/dataset.py`	Added `include_ligands: bool = False` parameter to `ProteinWaterDataset`; documentation describes ligand behavior; cache directory suffix appends `"_lig"` when enabled; state stored for preprocessing.
Preprocessing: Ligand Integration & Caching `src/dataset.py`	Preprocessing unpacks 3-tuple from `parse_asu_with_biotite()`; when `include_ligands` is enabled and ligands exist, appends ligand coordinates and one-hot element features after protein nodes; sets ligand residue indices to `-1` sentinel; creates and caches `is_ligand` boolean mask.
Data Loading with Backward Compatibility `src/dataset.py`	`__getitem__` loads `is_ligand` from cache and attaches it to `data["protein"].is_ligand` for graph construction.
Encoder Ligand Handling `src/encoder_base.py`	`CachedEmbeddingEncoder` now requires `embedding_dim` at construction; added learnable `ligand_embed` projection; `forward()` replaces cached embeddings for ligand nodes with learned projections when `is_ligand` mask is present; `from_config` requires `embedding_dim` in config.
GVP Residue Pooling Update `src/gvp_encoder.py`	Updated `_pool_by_residue()` to filter out ligand atoms (residue_index < 0) before scatter-based aggregation.
Embedding Script Updates `scripts/generate_esm_embeddings.py`, `scripts/generate_slae_embeddings.py`	Updated to unpack 3-tuple from `parse_asu_with_biotite()`; SLAE script clarified as legacy in module docstring.
Ligand Parsing Test Coverage `tests/test_dataset.py`	Updated existing tests to expect 3-tuple return; added `TestLigandParsing` validating ligand extraction semantics; added `TestLigandNodeIntegration` validating dataset behavior with `include_ligands` parameter.
Integration Test Updates `tests/test_dataset.py`	Updated quality-filter and water-filtering integration tests to unpack 3-tuple from `parse_asu_with_biotite()`.
Encoder Test Updates `tests/test_encoder.py`	Updated all cached encoder tests to thread `embedding_dim` through construction; removed assertions that `output_dims` was inferred after first forward; validated `embedding_dim` requirement and added parameter-count assertions for ligand projection.

Sequence Diagrams

sequenceDiagram
    participant User
    participant Dataset as ProteinWaterDataset
    participant Parser as parse_asu_with_biotite()
    participant Preprocess
    participant Encoder as CachedEmbeddingEncoder

    User->>Dataset: init(include_ligands=True)
    Dataset->>Parser: parse_asu_with_biotite(pdb_path)
    Parser-->>Dataset: (protein_atoms, water_atoms, ligand_atoms)
    
    Dataset->>Preprocess: build node tensors with is_ligand
    alt include_ligands=True & ligands exist
        Preprocess->>Preprocess: append ligand coordinates & features
        Preprocess->>Preprocess: set residue_index=-1 for ligands
        Preprocess->>Preprocess: create is_ligand boolean mask
    else include_ligands=False
        Preprocess->>Preprocess: is_ligand = all False
    end
    Preprocess-->>Dataset: cached data with is_ligand tensor
    
    User->>Dataset: __getitem__(idx)
    Dataset-->>User: graph with optional ligand nodes & is_ligand mask
    
    User->>Encoder: forward(data)
    alt is_ligand mask present & any True
        Encoder->>Encoder: ligand_embed(protein.x[ligand_mask])
        Encoder->>Encoder: replace cached embeddings for ligand rows
    else no ligand mask
        Encoder->>Encoder: return cached embeddings unchanged
    end
    
    Encoder-->>User: embeddings with learned ligand projections

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes introduce new feature logic (ligand extraction and encoding) across multiple interconnected files. Dataset parsing returns an additional value type; preprocessing conditionally appends ligand data with sentinel markers and boolean masking; the encoder adds learnable projection logic with conditional device-aware execution paths. Comprehensive test coverage adds validation complexity across parsing, integration, and encoder layers. The variety of heterogeneous changes across dataset, encoder, pooling, scripts, and tests warrants substantial review effort.

Poem

A ligand hops in, tail held high,
No longer lost, no longer shy,
With is_ligand marks, bright and true,
The encoder learns what ligands do—
One hop, one prep, one learned projection,
Our protein graph finds new direction! 🐰✨

Suggested reviewers

marcuscollins

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Adding ligands to dataset processing' accurately captures the main change: extending dataset processing to include ligand atoms alongside proteins and water molecules.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev_ligands

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR extends dataset preprocessing to include non-water ligands in the static structure representation, folding ligand atoms into the existing protein node type and tagging them via an is_ligand boolean mask.

Changes:

Update PDB parsing to return (protein_atoms, water_atoms, ligand_atoms) and (optionally) append ligands to protein nodes via include_ligands.
Add protein.is_ligand to cached geometry and HeteroData for downstream masking/conditioning.
Add integration tests and a new ligand-containing PDB fixture (4h0b) to validate parsing and dataset behavior.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/dataset.py`	Adds ligand parsing/output, `include_ligands` option, and persists `is_ligand` in caches and `HeteroData`.
`src/encoder_base.py`	Adds a learned projection for ligand rows when using cached-embedding encoders (ESM/SLAE).
`tests/test_dataset.py`	Expands parsing tests for 3-way partitioning and adds dataset integration tests for ligand inclusion.
`tests/conftest.py`	Adds a `pdb_4h0b` fixture for ligand test coverage.
`pyproject.toml`	Adds `rdkit` and swaps the PyMOL package reference to `pymol-open-source-whl`.
`uv.lock`	Updates locked dependency set to reflect dependency changes (incl. rdkit/jaxtyping/pymol package rename).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/test_dataset.py (1)

566-697: Add one cached-embedding ligand regression test.

These cases only exercise encoder_type="gvp", but the new ligand-specific behavior lives in CachedEmbeddingEncoder.forward() for esm/slae. A tiny fixture-backed test that asserts ligand rows are replaced would cover the path most likely to regress.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_dataset.py` around lines 566 - 697, Add a small integration test
in tests/test_dataset.py that exercises CachedEmbeddingEncoder.forward for
encoder_type="esm" (or "slae") to prevent regressions: construct a
ProteinWaterDataset with include_ligands=True and preprocess=True using a
fixture-backed cached-embedding file (or reuse the existing pdb fixtures), then
fetch data = ds[0], locate ligand rows via data["protein"].is_ligand, and assert
that those ligand-row embeddings in data["protein"].x match the expected cached
embeddings (i.e., are replaced by the cached values) and differ from the
original one-hot element encoding; reference CachedEmbeddingEncoder.forward and
ProteinWaterDataset to find the encoder flow. Ensure the test is small, uses the
tmp_path fixture for any temp files, and only targets the cached-embedding path
for esm/slae.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/dataset.py`:
- Around line 73-81: The current ligand selection uses ligand_mask =
~protein_mask & ~water_mask which includes any non-protein/non-water ATOM
records; instead restrict ligands to HETATM entries by using the Biotite flag
atoms.hetero: update ligand_mask to also require atoms.hetero (e.g., ligand_mask
= (~protein_mask & ~water_mask) & atoms.hetero) so ligand_atoms contains only
non-protein, non-water HETATM records; keep protein_mask, water_mask,
protein_atoms, water_atoms assignments unchanged and return protein_atoms,
water_atoms, ligand_atoms.
- Around line 1070-1078: _pool_by_residue will fail if residue_index contains -1
for ligands because scatter operations (scatter_mean, scatter_add, scatter_max)
cannot accept negative indices; before any scatter keyed by residue_index (in
_pool_by_residue and any other pooling guarded by pool_residue), filter out
ligand atoms using the is_ligand mask (e.g., mask = ~is_ligand) and pass atom
embeddings and residue_index masked arrays (atom_embed[mask],
residue_index[mask]) to scatter with dim_size=num_residues; apply the same
masking wherever residue_index is used for pooling to avoid negative-index
errors.

In `@src/encoder_base.py`:
- Around line 182-190: The ligand_embed Linear is being created lazily in
forward(), so its parameters are not registered when optimizers/DDP/FSDP capture
model parameters; move the nn.Linear initialization into __init__ by creating
self.ligand_embed there (using the given embedding_dim) and remove the
forward-time instantiation, or if embedding_dim truly isn’t known at
construction, add an explicit initialize_ligand_embed(embedding_dim) method that
constructs and registers self.ligand_embed before the optimizer is created (call
it in setup code), ensuring parameters are registered with model.parameters();
as a last-resort alternative, if you cannot construct before optimizer, call
optimizer.add_param_group(...) with the new parameters immediately after
creation (note this still may not work with DDP/FSDP), and update references to
ligand_embed in forward() to assume it already exists.

---

Nitpick comments:
In `@tests/test_dataset.py`:
- Around line 566-697: Add a small integration test in tests/test_dataset.py
that exercises CachedEmbeddingEncoder.forward for encoder_type="esm" (or "slae")
to prevent regressions: construct a ProteinWaterDataset with
include_ligands=True and preprocess=True using a fixture-backed cached-embedding
file (or reuse the existing pdb fixtures), then fetch data = ds[0], locate
ligand rows via data["protein"].is_ligand, and assert that those ligand-row
embeddings in data["protein"].x match the expected cached embeddings (i.e., are
replaced by the cached values) and differ from the original one-hot element
encoding; reference CachedEmbeddingEncoder.forward and ProteinWaterDataset to
find the encoder flow. Ensure the test is small, uses the tmp_path fixture for
any temp files, and only targets the cached-embedding path for esm/slae.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abe0f086-5c49-47da-8543-bf16a9391a11

📥 Commits

Reviewing files that changed from the base of the PR and between b9c0ce9 and bebf742.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

pyproject.toml
src/dataset.py
src/encoder_base.py
tests/conftest.py
tests/test_dataset.py
tests/test_files/4h0b/4h0b_final.pdb

Copilot

Pull request overview

Copilot reviewed 9 out of 11 changed files in this pull request and generated 2 comments.

vratins added 4 commits March 30, 2026 12:08

code to parse ligands as a 'non-water' node alongside proteins, tests…

38c2e0b

… for the same

Merge branch 'main' into dev_ligands

b8568a5

rdkit dependencies

7cf6de1

tests

1529943

Copilot AI review requested due to automatic review settings March 30, 2026 20:14

vratins temporarily deployed to testing March 30, 2026 20:14 — with GitHub Actions Inactive

vratins had a problem deploying to testing March 30, 2026 20:14 — with GitHub Actions Failure

Copilot started reviewing on behalf of vratins March 30, 2026 20:15 View session

Auto-commit ruff fixes [skip ci]

bebf742

vratins requested review from DorisMai and marcuscollins March 30, 2026 20:16

vratins linked an issue Mar 30, 2026 that may be closed by this pull request

Ligand Encoding #7

Open

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/dataset.py

Comment thread src/dataset.py

Comment thread src/dataset.py Outdated

Comment thread pyproject.toml

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread src/dataset.py

Comment thread src/dataset.py Outdated

Comment thread src/encoder_base.py Outdated

mates comments

c678fc0

vratins temporarily deployed to testing June 23, 2026 20:32 — with GitHub Actions Inactive

addressing het atom discrepancy

e78be52

Copilot AI review requested due to automatic review settings June 24, 2026 01:00

vratins temporarily deployed to testing June 24, 2026 01:00 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 24, 2026 01:00 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/dataset.py

Comment thread src/encoder_base.py Outdated

addressing comments

c824e2f

vratins temporarily deployed to testing June 24, 2026 01:11 — with GitHub Actions Inactive

Auto-commit ruff fixes [skip ci]

aca3180

Copilot AI review requested due to automatic review settings June 24, 2026 01:13

vratins removed the request for review from Copilot June 24, 2026 01:13

vratins closed this Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding ligands to dataset processing#78

Adding ligands to dataset processing#78
vratins wants to merge 9 commits into
mainfrom
dev_ligands

vratins commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Poem

Suggested reviewers

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vratins commented Mar 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Poem

Suggested reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vratins commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading