Adding explicit sampling strategies by vratins · Pull Request #84 · diff-use/WaterFlow

vratins · 2026-06-17T06:22:55Z

sample_waters_uniform_ball — places each water uniformly inside a ball of radius cutoff around a randomly chosen protein anchor atom, with volume correction for sampling uniformly within a ball of radius r.
sample_waters_scaled_gaussian — N(0, (sigma^2)*I) where sigma is the per-graph std of protein coordinates. Equivalent to old behavior but explicit.
Added the above to the FlowMatcher class
Tests cover shape/count correctness, the empty-graph edge case, and a real-structure test against 6eey that verifies the cutoff guarantee holds on actual protein geometry with two graphs of different water counts batched together.

Summary by CodeRabbit

Release Notes

New Features
- Added new water-noise sampling strategies (uniform within a ball and scaled Gaussian) for flexible initialization during training and inference.
- Introduced configurable dynamic edge-building behavior with an automatic resolution option.
- Extended inference to support specifying water_count (in addition to water_ratio) when sampling initial water nodes.
Tests
- Added unit tests covering water sampling behavior, edge-policy auto-resolution, and validation errors for invalid water counts.

coderabbitai · 2026-06-17T06:23:10Z

Warning

Review limit reached

@vratins, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 56 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7f27399-66d7-4dca-ba0a-b7fca30f5dd7

📥 Commits

Reviewing files that changed from the base of the PR and between 15dbee8 and 350bba9.

📒 Files selected for processing (1)

src/flow.py

📝 Walkthrough

Walkthrough

Adds two water-sampling helpers, threads sampling_strategy and dynamic_edge_policy through FlowMatcher, updates training and validation to use per-graph water counts and sampled initial positions, and adds water_count support to both inference integrators. Tests cover sampler behavior, policy resolution, and water-count validation.

Changes

Water sampling strategies and dynamic edge policies

Layer / File(s)	Summary
New module-level water sampling helpers `src/flow.py`, `tests/test_flow.py`	Adds `scatter` import, implements `sample_waters_uniform_ball` and `sample_waters_scaled_gaussian`, and adds tests covering tensor shapes, per-graph counts, empty-water behavior, cutoff constraints, large-spread inputs, a real structure fixture, and scaled-Gaussian edge cases.
FlowMatcher strategy constants and `__init__` extension `src/flow.py`, `tests/test_flow.py`	Declares `SAMPLING_STRATEGIES` and `DYNAMIC_EDGE_POLICIES`, extends `__init__` with validated `sampling_strategy` and `dynamic_edge_policy` parameters, adds the effective-policy resolver, and tests the auto-policy mapping.
Training and validation step integration `src/flow.py`	Updates `training_step` and `validation_step` to set `batch.dynamic_edge_policy`, infer `num_graphs`, compute per-graph water counts, derive `sigma_per_graph`, sample `x0` through `_sample_waters`, and aggregate RMSD with `scatter_mean`.
Inference integrator `water_count` branching `src/flow.py`	Adds `water_count` to `euler_integrate` and `rk4_integrate`, updates their docstrings, and reworks water-node setup so count, ratio, and existing-node paths all resample positions through `_sample_waters`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through balls of watery light,
Gaussian drifts and counts feel right.
The flow now listens, step by step,
With edge and water rules well-kept. 💧

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly reflects the main change: adding explicit sampling strategies for water placement.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev_water_sampling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR introduces explicit water-noise sampling strategies in FlowMatcher, adding a uniform-within-cutoff sampler around protein atoms and an explicit “scaled Gaussian” sampler (matching the prior implicit behavior), along with tests covering correctness and edge cases.

Changes:

Added sample_waters_uniform_ball() and sample_waters_scaled_gaussian() utilities and integrated them into FlowMatcher via a sampling_strategy option.
Added dynamic_edge_policy plumbing and a resolver method (currently not consumed by the model’s edge builder).
Expanded tests/test_flow.py with unit tests for both sampling strategies, including a real-structure cutoff guarantee test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File	Description
`src/flow.py`	Adds sampling strategy implementations + integrates sampling into training/validation/integration paths; introduces `sampling_strategy`/`dynamic_edge_policy` config.
`tests/test_flow.py`	Adds tests validating sampler shape/count behavior, empty-water handling, cutoff guarantees, and batching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/flow.py`:
- Line 724: The return type annotation for the training_step method at line 724
declares dict[str, float | int | None] but the function actually returns a
sigma_per_graph tensor and an optional per_sample_info dict which do not match
this contract. Update the return type annotation to accurately reflect what
training_step actually returns, including the tensor type for sigma_per_graph
and the optional dict type for per_sample_info, or alternatively modify the
returned values to match the declared return type by serializing the sigma
tensor to a scalar or list suitable for metric logging.
- Line 1001: The docstring for `water_true` at line 1001 states it should be
`None` when `water_ratio` or `water_count` is used, but the result builders are
returning the original ground-truth waters instead of `None`. Fix this by
applying conditional logic in the result builders (around lines 1024-1031, 1122,
and 1145-1152) to check if `water_ratio` or `water_count` flags are set, and if
so, set `water_true` to `None` rather than returning the ground-truth water
values. Follow the same flag/result logic pattern already implemented in the
`rk4_integrate()` function to ensure consistency across all result building
locations.
- Around line 676-682: The _effective_dynamic_edge_policy method correctly
resolves the edge policy to either "radius" or "knn_if_isolated" and
training/validation attach this to the batch, but the
ProteinWaterUpdate.build_edges() method unconditionally builds KNN edges and
ignores this attribute entirely. Modify the build_edges() method to read the
dynamic_edge_policy attribute from the batch input and conditionally construct
either KNN edges or radius-based edges based on its value instead of always
building KNN edges. This will ensure the configured policy actually affects the
edge construction at runtime.
- Around line 931-959: Add input validation for the water_count parameter at the
beginning of the _setup_water_nodes_from_count method to reject negative or
invalid values before they propagate into the num_waters tensor and downstream
_sample_waters call. Raise a descriptive ValueError if water_count is less than
or equal to zero. Apply the same validation logic to the other methods mentioned
in lines 1024-1031 and 1145-1152 that also accept similar count parameters as
public API boundaries.
- Around line 84-87: Add a guard check before the anchor selection at the line
with `anchors = protein_pos.to(device)[graph_offsets + local_idx]` to ensure
that any graph being processed has at least one protein atom available.
Specifically, verify that all values in `graph_sizes` (which comes from
`num_p_per_graph[batch_w]`) are greater than zero before proceeding with the
local_idx computation and anchor selection, otherwise raise an informative error
that indicates a graph with waters but no protein atoms was encountered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe982420-7025-40de-ad55-6aa3248b0d89

📥 Commits

Reviewing files that changed from the base of the PR and between c3b9db6 and 6fff711.

📒 Files selected for processing (2)

src/flow.py
tests/test_flow.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_flow.py`:
- Around line 572-577: The real-structure sampling test is missing a fixed RNG
seed, so its random sampling is not reproducible. Update
test_real_structure_cutoff_and_batch to set a deterministic seed with
torch.manual_seed before any sampling occurs, keeping the cutoff check stable
and making any rare boundary failures repeatable. Use the test function name to
locate the change and ensure the seed is applied at the start of the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59d5fa95-1c43-41e4-a6aa-55f364451518

📥 Commits

Reviewing files that changed from the base of the PR and between 6fff711 and 15dbee8.

📒 Files selected for processing (2)

src/flow.py
tests/test_flow.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/flow.py

coderabbitai · 2026-06-24T23:25:22Z

+    @pytest.mark.slow
+    def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
+        """Cutoff guarantee holds on real protein geometry; batch indexing is correct
+        when two structures with different water counts are packed into one call."""
+        import biotite.structure as bts
+        from biotite.structure.io.pdb import get_structure, PDBFile


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Seed RNG in this real-structure test for reproducibility.

Unlike the other sampler tests, this one performs random sampling without torch.manual_seed. Combined with the tight cutoff + 1e-4 tolerance and float32 distances on real (large-magnitude) coordinates, any rare boundary failure would be non-reproducible and hard to diagnose. Add a fixed seed.

🌱 Proposed seeding

def test_real_structure_cutoff_and_batch(self, device, pdb_6eey): """Cutoff guarantee holds on real protein geometry; batch indexing is correct when two structures with different water counts are packed into one call.""" + torch.manual_seed(0) import biotite.structure as bts from biotite.structure.io.pdb import get_structure, PDBFile

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@pytest.mark.slow

def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):

"""Cutoff guarantee holds on real protein geometry; batch indexing is correct

when two structures with different water counts are packed into one call."""

import biotite.structure as bts

from biotite.structure.io.pdb import get_structure, PDBFile

`@pytest.mark.slow`

def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):

"""Cutoff guarantee holds on real protein geometry; batch indexing is correct

when two structures with different water counts are packed into one call."""

torch.manual_seed(0)

import biotite.structure as bts

from biotite.structure.io.pdb import get_structure, PDBFile

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_flow.py` around lines 572 - 577, The real-structure sampling test is missing a fixed RNG seed, so its random sampling is not reproducible. Update test_real_structure_cutoff_and_batch to set a deterministic seed with torch.manual_seed before any sampling occurs, keeping the cutoff check stable and making any rare boundary failures repeatable. Use the test function name to locate the change and ensure the seed is applied at the start of the test.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/flow.py:977

Same as _setup_water_nodes_from_ratio(): avoid hard-coding water feature width to 16 here. Derive the feature width/dtype from the existing g['water'].x when present to keep node feature dimensionality consistent.

        # create water features (oxygen one-hot, index 2 for 'O' in ELEMENT_VOCAB)
        water_x = torch.zeros(total_waters, 16, device=device)
        water_x[:, 2] = 1.0  # oxygen is index 2 in ELEMENT_VOCAB

vratins added 2 commits June 17, 2026 06:03

water sampling refactor

981b4ad

more tests

6fff711

Copilot AI review requested due to automatic review settings June 17, 2026 06:22

vratins had a problem deploying to testing June 17, 2026 06:23 — with GitHub Actions Failure

vratins temporarily deployed to testing June 17, 2026 06:23 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 17, 2026 06:23 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread src/flow.py

Comment thread src/flow.py

Comment thread src/flow.py Outdated

Comment thread src/flow.py

Comment thread src/flow.py

Comment thread src/flow.py

Comment thread src/flow.py

Comment thread tests/test_flow.py

Comment thread src/flow.py

Comment thread src/flow.py

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/flow.py

Comment thread src/flow.py

Comment thread src/flow.py Outdated

Comment thread src/flow.py

Comment thread src/flow.py Outdated

linting fixes

056283c

vratins temporarily deployed to testing June 17, 2026 07:12 — with GitHub Actions Inactive

Auto-commit ruff fixes [skip ci]

6317df5

Copilot AI review requested due to automatic review settings June 17, 2026 07:13

vratins removed the request for review from Copilot June 17, 2026 07:13

addressing comments

15dbee8

Copilot AI review requested due to automatic review settings June 24, 2026 23:22

vratins temporarily deployed to testing June 24, 2026 23:22 — with GitHub Actions Inactive

Copilot started reviewing on behalf of vratins June 24, 2026 23:22 View session

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread src/flow.py

Comment thread src/flow.py

addressing comment for batch index offset

350bba9

vratins temporarily deployed to testing June 24, 2026 23:29 — with GitHub Actions Inactive

vratins deployed to testing June 24, 2026 23:29 — with GitHub Actions Active

Uh oh!

Conversation

vratins commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vratins commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading