Skip to content

Adding explicit sampling strategies#84

Open
vratins wants to merge 6 commits into
mainfrom
dev_water_sampling
Open

Adding explicit sampling strategies#84
vratins wants to merge 6 commits into
mainfrom
dev_water_sampling

Conversation

@vratins

@vratins vratins commented Jun 17, 2026

Copy link
Copy Markdown
Contributor
  • sample_waters_uniform_ball — places each water uniformly inside a ball of radius cutoff around a randomly chosen protein anchor atom, with volume correction for sampling uniformly within a ball of radius r.

  • sample_waters_scaled_gaussianN(0, (sigma^2)*I) where sigma is the per-graph std of protein coordinates. Equivalent to old behavior but explicit.

  • Added the above to the FlowMatcher class

  • Tests cover shape/count correctness, the empty-graph edge case, and a real-structure test against 6eey that verifies the cutoff guarantee holds on actual protein geometry with two graphs of different water counts batched together.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added new water-noise sampling strategies (uniform within a ball and scaled Gaussian) for flexible initialization during training and inference.
    • Introduced configurable dynamic edge-building behavior with an automatic resolution option.
    • Extended inference to support specifying water_count (in addition to water_ratio) when sampling initial water nodes.
  • Tests

    • Added unit tests covering water sampling behavior, edge-policy auto-resolution, and validation errors for invalid water counts.

Copilot AI review requested due to automatic review settings June 17, 2026 06:22
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vratins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 56 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7f27399-66d7-4dca-ba0a-b7fca30f5dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 15dbee8 and 350bba9.

📒 Files selected for processing (1)
  • src/flow.py
📝 Walkthrough

Walkthrough

Adds two water-sampling helpers, threads sampling_strategy and dynamic_edge_policy through FlowMatcher, updates training and validation to use per-graph water counts and sampled initial positions, and adds water_count support to both inference integrators. Tests cover sampler behavior, policy resolution, and water-count validation.

Changes

Water sampling strategies and dynamic edge policies

Layer / File(s) Summary
New module-level water sampling helpers
src/flow.py, tests/test_flow.py
Adds scatter import, implements sample_waters_uniform_ball and sample_waters_scaled_gaussian, and adds tests covering tensor shapes, per-graph counts, empty-water behavior, cutoff constraints, large-spread inputs, a real structure fixture, and scaled-Gaussian edge cases.
FlowMatcher strategy constants and __init__ extension
src/flow.py, tests/test_flow.py
Declares SAMPLING_STRATEGIES and DYNAMIC_EDGE_POLICIES, extends __init__ with validated sampling_strategy and dynamic_edge_policy parameters, adds the effective-policy resolver, and tests the auto-policy mapping.
Training and validation step integration
src/flow.py
Updates training_step and validation_step to set batch.dynamic_edge_policy, infer num_graphs, compute per-graph water counts, derive sigma_per_graph, sample x0 through _sample_waters, and aggregate RMSD with scatter_mean.
Inference integrator water_count branching
src/flow.py
Adds water_count to euler_integrate and rk4_integrate, updates their docstrings, and reworks water-node setup so count, ratio, and existing-node paths all resample positions through _sample_waters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through balls of watery light,
Gaussian drifts and counts feel right.
The flow now listens, step by step,
With edge and water rules well-kept. 💧

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main change: adding explicit sampling strategies for water placement.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_water_sampling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces explicit water-noise sampling strategies in FlowMatcher, adding a uniform-within-cutoff sampler around protein atoms and an explicit “scaled Gaussian” sampler (matching the prior implicit behavior), along with tests covering correctness and edge cases.

Changes:

  • Added sample_waters_uniform_ball() and sample_waters_scaled_gaussian() utilities and integrated them into FlowMatcher via a sampling_strategy option.
  • Added dynamic_edge_policy plumbing and a resolver method (currently not consumed by the model’s edge builder).
  • Expanded tests/test_flow.py with unit tests for both sampling strategies, including a real-structure cutoff guarantee test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File Description
src/flow.py Adds sampling strategy implementations + integrates sampling into training/validation/integration paths; introduces sampling_strategy/dynamic_edge_policy config.
tests/test_flow.py Adds tests validating sampler shape/count behavior, empty-water handling, cutoff guarantees, and batching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow.py
Comment thread src/flow.py
Comment thread src/flow.py Outdated
Comment thread src/flow.py
Comment thread src/flow.py
Comment thread src/flow.py
Comment thread src/flow.py
Comment thread tests/test_flow.py
Comment thread src/flow.py
Comment thread src/flow.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/flow.py`:
- Line 724: The return type annotation for the training_step method at line 724
declares dict[str, float | int | None] but the function actually returns a
sigma_per_graph tensor and an optional per_sample_info dict which do not match
this contract. Update the return type annotation to accurately reflect what
training_step actually returns, including the tensor type for sigma_per_graph
and the optional dict type for per_sample_info, or alternatively modify the
returned values to match the declared return type by serializing the sigma
tensor to a scalar or list suitable for metric logging.
- Line 1001: The docstring for `water_true` at line 1001 states it should be
`None` when `water_ratio` or `water_count` is used, but the result builders are
returning the original ground-truth waters instead of `None`. Fix this by
applying conditional logic in the result builders (around lines 1024-1031, 1122,
and 1145-1152) to check if `water_ratio` or `water_count` flags are set, and if
so, set `water_true` to `None` rather than returning the ground-truth water
values. Follow the same flag/result logic pattern already implemented in the
`rk4_integrate()` function to ensure consistency across all result building
locations.
- Around line 676-682: The _effective_dynamic_edge_policy method correctly
resolves the edge policy to either "radius" or "knn_if_isolated" and
training/validation attach this to the batch, but the
ProteinWaterUpdate.build_edges() method unconditionally builds KNN edges and
ignores this attribute entirely. Modify the build_edges() method to read the
dynamic_edge_policy attribute from the batch input and conditionally construct
either KNN edges or radius-based edges based on its value instead of always
building KNN edges. This will ensure the configured policy actually affects the
edge construction at runtime.
- Around line 931-959: Add input validation for the water_count parameter at the
beginning of the _setup_water_nodes_from_count method to reject negative or
invalid values before they propagate into the num_waters tensor and downstream
_sample_waters call. Raise a descriptive ValueError if water_count is less than
or equal to zero. Apply the same validation logic to the other methods mentioned
in lines 1024-1031 and 1145-1152 that also accept similar count parameters as
public API boundaries.
- Around line 84-87: Add a guard check before the anchor selection at the line
with `anchors = protein_pos.to(device)[graph_offsets + local_idx]` to ensure
that any graph being processed has at least one protein atom available.
Specifically, verify that all values in `graph_sizes` (which comes from
`num_p_per_graph[batch_w]`) are greater than zero before proceeding with the
local_idx computation and anchor selection, otherwise raise an informative error
that indicates a graph with waters but no protein atoms was encountered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe982420-7025-40de-ad55-6aa3248b0d89

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and 6fff711.

📒 Files selected for processing (2)
  • src/flow.py
  • tests/test_flow.py

Comment thread src/flow.py
Comment thread src/flow.py
Comment thread src/flow.py Outdated
Comment thread src/flow.py
Comment thread src/flow.py Outdated
Copilot AI review requested due to automatic review settings June 17, 2026 07:13
@vratins vratins removed the request for review from Copilot June 17, 2026 07:13
Copilot AI review requested due to automatic review settings June 24, 2026 23:22

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_flow.py`:
- Around line 572-577: The real-structure sampling test is missing a fixed RNG
seed, so its random sampling is not reproducible. Update
test_real_structure_cutoff_and_batch to set a deterministic seed with
torch.manual_seed before any sampling occurs, keeping the cutoff check stable
and making any rare boundary failures repeatable. Use the test function name to
locate the change and ensure the seed is applied at the start of the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59d5fa95-1c43-41e4-a6aa-55f364451518

📥 Commits

Reviewing files that changed from the base of the PR and between 6fff711 and 15dbee8.

📒 Files selected for processing (2)
  • src/flow.py
  • tests/test_flow.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/flow.py

Comment thread tests/test_flow.py
Comment on lines +572 to +577
@pytest.mark.slow
def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
"""Cutoff guarantee holds on real protein geometry; batch indexing is correct
when two structures with different water counts are packed into one call."""
import biotite.structure as bts
from biotite.structure.io.pdb import get_structure, PDBFile

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Seed RNG in this real-structure test for reproducibility.

Unlike the other sampler tests, this one performs random sampling without torch.manual_seed. Combined with the tight cutoff + 1e-4 tolerance and float32 distances on real (large-magnitude) coordinates, any rare boundary failure would be non-reproducible and hard to diagnose. Add a fixed seed.

🌱 Proposed seeding
     def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
         """Cutoff guarantee holds on real protein geometry; batch indexing is correct
         when two structures with different water counts are packed into one call."""
+        torch.manual_seed(0)
         import biotite.structure as bts
         from biotite.structure.io.pdb import get_structure, PDBFile
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.slow
def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
"""Cutoff guarantee holds on real protein geometry; batch indexing is correct
when two structures with different water counts are packed into one call."""
import biotite.structure as bts
from biotite.structure.io.pdb import get_structure, PDBFile
`@pytest.mark.slow`
def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
"""Cutoff guarantee holds on real protein geometry; batch indexing is correct
when two structures with different water counts are packed into one call."""
torch.manual_seed(0)
import biotite.structure as bts
from biotite.structure.io.pdb import get_structure, PDBFile
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_flow.py` around lines 572 - 577, The real-structure sampling test
is missing a fixed RNG seed, so its random sampling is not reproducible. Update
test_real_structure_cutoff_and_batch to set a deterministic seed with
torch.manual_seed before any sampling occurs, keeping the cutoff check stable
and making any rare boundary failures repeatable. Use the test function name to
locate the change and ensure the seed is applied at the start of the test.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/flow.py:977

  • Same as _setup_water_nodes_from_ratio(): avoid hard-coding water feature width to 16 here. Derive the feature width/dtype from the existing g['water'].x when present to keep node feature dimensionality consistent.
        # create water features (oxygen one-hot, index 2 for 'O' in ELEMENT_VOCAB)
        water_x = torch.zeros(total_waters, 16, device=device)
        water_x[:, 2] = 1.0  # oxygen is index 2 in ELEMENT_VOCAB

Comment thread src/flow.py
Comment thread src/flow.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants