Adding ligands to dataset processing#78
Conversation
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (8)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe pull request extends a protein dataset pipeline to support ligand atoms. Dataset parsing now extracts ligands as a third return value; preprocessing optionally includes ligand coordinates and features with sentinel markers and boolean masks. The embedding encoder adds a learnable ligand projection layer. Embedding scripts, pooling operations, and tests were updated for the 3-tuple return and new ligand behavior. ChangesLigand Support in Dataset Pipeline
Sequence DiagramssequenceDiagram
participant User
participant Dataset as ProteinWaterDataset
participant Parser as parse_asu_with_biotite()
participant Preprocess
participant Encoder as CachedEmbeddingEncoder
User->>Dataset: init(include_ligands=True)
Dataset->>Parser: parse_asu_with_biotite(pdb_path)
Parser-->>Dataset: (protein_atoms, water_atoms, ligand_atoms)
Dataset->>Preprocess: build node tensors with is_ligand
alt include_ligands=True & ligands exist
Preprocess->>Preprocess: append ligand coordinates & features
Preprocess->>Preprocess: set residue_index=-1 for ligands
Preprocess->>Preprocess: create is_ligand boolean mask
else include_ligands=False
Preprocess->>Preprocess: is_ligand = all False
end
Preprocess-->>Dataset: cached data with is_ligand tensor
User->>Dataset: __getitem__(idx)
Dataset-->>User: graph with optional ligand nodes & is_ligand mask
User->>Encoder: forward(data)
alt is_ligand mask present & any True
Encoder->>Encoder: ligand_embed(protein.x[ligand_mask])
Encoder->>Encoder: replace cached embeddings for ligand rows
else no ligand mask
Encoder->>Encoder: return cached embeddings unchanged
end
Encoder-->>User: embeddings with learned ligand projections
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes introduce new feature logic (ligand extraction and encoding) across multiple interconnected files. Dataset parsing returns an additional value type; preprocessing conditionally appends ligand data with sentinel markers and boolean masking; the encoder adds learnable projection logic with conditional device-aware execution paths. Comprehensive test coverage adds validation complexity across parsing, integration, and encoder layers. The variety of heterogeneous changes across dataset, encoder, pooling, scripts, and tests warrants substantial review effort. Poem
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR extends dataset preprocessing to include non-water ligands in the static structure representation, folding ligand atoms into the existing protein node type and tagging them via an is_ligand boolean mask.
Changes:
- Update PDB parsing to return
(protein_atoms, water_atoms, ligand_atoms)and (optionally) append ligands to protein nodes viainclude_ligands. - Add
protein.is_ligandto cached geometry andHeteroDatafor downstream masking/conditioning. - Add integration tests and a new ligand-containing PDB fixture (
4h0b) to validate parsing and dataset behavior.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/dataset.py |
Adds ligand parsing/output, include_ligands option, and persists is_ligand in caches and HeteroData. |
src/encoder_base.py |
Adds a learned projection for ligand rows when using cached-embedding encoders (ESM/SLAE). |
tests/test_dataset.py |
Expands parsing tests for 3-way partitioning and adds dataset integration tests for ligand inclusion. |
tests/conftest.py |
Adds a pdb_4h0b fixture for ligand test coverage. |
pyproject.toml |
Adds rdkit and swaps the PyMOL package reference to pymol-open-source-whl. |
uv.lock |
Updates locked dependency set to reflect dependency changes (incl. rdkit/jaxtyping/pymol package rename). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/test_dataset.py (1)
566-697: Add one cached-embedding ligand regression test.These cases only exercise
encoder_type="gvp", but the new ligand-specific behavior lives inCachedEmbeddingEncoder.forward()foresm/slae. A tiny fixture-backed test that asserts ligand rows are replaced would cover the path most likely to regress.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_dataset.py` around lines 566 - 697, Add a small integration test in tests/test_dataset.py that exercises CachedEmbeddingEncoder.forward for encoder_type="esm" (or "slae") to prevent regressions: construct a ProteinWaterDataset with include_ligands=True and preprocess=True using a fixture-backed cached-embedding file (or reuse the existing pdb fixtures), then fetch data = ds[0], locate ligand rows via data["protein"].is_ligand, and assert that those ligand-row embeddings in data["protein"].x match the expected cached embeddings (i.e., are replaced by the cached values) and differ from the original one-hot element encoding; reference CachedEmbeddingEncoder.forward and ProteinWaterDataset to find the encoder flow. Ensure the test is small, uses the tmp_path fixture for any temp files, and only targets the cached-embedding path for esm/slae.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/dataset.py`:
- Around line 73-81: The current ligand selection uses ligand_mask =
~protein_mask & ~water_mask which includes any non-protein/non-water ATOM
records; instead restrict ligands to HETATM entries by using the Biotite flag
atoms.hetero: update ligand_mask to also require atoms.hetero (e.g., ligand_mask
= (~protein_mask & ~water_mask) & atoms.hetero) so ligand_atoms contains only
non-protein, non-water HETATM records; keep protein_mask, water_mask,
protein_atoms, water_atoms assignments unchanged and return protein_atoms,
water_atoms, ligand_atoms.
- Around line 1070-1078: _pool_by_residue will fail if residue_index contains -1
for ligands because scatter operations (scatter_mean, scatter_add, scatter_max)
cannot accept negative indices; before any scatter keyed by residue_index (in
_pool_by_residue and any other pooling guarded by pool_residue), filter out
ligand atoms using the is_ligand mask (e.g., mask = ~is_ligand) and pass atom
embeddings and residue_index masked arrays (atom_embed[mask],
residue_index[mask]) to scatter with dim_size=num_residues; apply the same
masking wherever residue_index is used for pooling to avoid negative-index
errors.
In `@src/encoder_base.py`:
- Around line 182-190: The ligand_embed Linear is being created lazily in
forward(), so its parameters are not registered when optimizers/DDP/FSDP capture
model parameters; move the nn.Linear initialization into __init__ by creating
self.ligand_embed there (using the given embedding_dim) and remove the
forward-time instantiation, or if embedding_dim truly isn’t known at
construction, add an explicit initialize_ligand_embed(embedding_dim) method that
constructs and registers self.ligand_embed before the optimizer is created (call
it in setup code), ensuring parameters are registered with model.parameters();
as a last-resort alternative, if you cannot construct before optimizer, call
optimizer.add_param_group(...) with the new parameters immediately after
creation (note this still may not work with DDP/FSDP), and update references to
ligand_embed in forward() to assume it already exists.
---
Nitpick comments:
In `@tests/test_dataset.py`:
- Around line 566-697: Add a small integration test in tests/test_dataset.py
that exercises CachedEmbeddingEncoder.forward for encoder_type="esm" (or "slae")
to prevent regressions: construct a ProteinWaterDataset with
include_ligands=True and preprocess=True using a fixture-backed cached-embedding
file (or reuse the existing pdb fixtures), then fetch data = ds[0], locate
ligand rows via data["protein"].is_ligand, and assert that those ligand-row
embeddings in data["protein"].x match the expected cached embeddings (i.e., are
replaced by the cached values) and differ from the original one-hot element
encoding; reference CachedEmbeddingEncoder.forward and ProteinWaterDataset to
find the encoder flow. Ensure the test is small, uses the tmp_path fixture for
any temp files, and only targets the cached-embedding path for esm/slae.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: abe0f086-5c49-47da-8543-bf16a9391a11
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
pyproject.tomlsrc/dataset.pysrc/encoder_base.pytests/conftest.pytests/test_dataset.pytests/test_files/4h0b/4h0b_final.pdb
PR to include ligands in the static structure when encoding proteins. This is a v1 change so tried to keep it as simple of a change as possible.
proteinnode types with ais_ligandflag.Summary by CodeRabbit
include_ligands).embedding_dimat construction, avoiding first-forward inference issues.pymol-open-source-whl>=3.1.0.4.