Adding explicit sampling strategies#84
Conversation
|
Warning Review limit reached
More reviews will be available in 52 minutes and 56 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughAdds two water-sampling helpers, threads sampling_strategy and dynamic_edge_policy through FlowMatcher, updates training and validation to use per-graph water counts and sampled initial positions, and adds water_count support to both inference integrators. Tests cover sampler behavior, policy resolution, and water-count validation. ChangesWater sampling strategies and dynamic edge policies
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces explicit water-noise sampling strategies in FlowMatcher, adding a uniform-within-cutoff sampler around protein atoms and an explicit “scaled Gaussian” sampler (matching the prior implicit behavior), along with tests covering correctness and edge cases.
Changes:
- Added
sample_waters_uniform_ball()andsample_waters_scaled_gaussian()utilities and integrated them intoFlowMatchervia asampling_strategyoption. - Added
dynamic_edge_policyplumbing and a resolver method (currently not consumed by the model’s edge builder). - Expanded
tests/test_flow.pywith unit tests for both sampling strategies, including a real-structure cutoff guarantee test.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
src/flow.py |
Adds sampling strategy implementations + integrates sampling into training/validation/integration paths; introduces sampling_strategy/dynamic_edge_policy config. |
tests/test_flow.py |
Adds tests validating sampler shape/count behavior, empty-water handling, cutoff guarantees, and batching behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/flow.py`:
- Line 724: The return type annotation for the training_step method at line 724
declares dict[str, float | int | None] but the function actually returns a
sigma_per_graph tensor and an optional per_sample_info dict which do not match
this contract. Update the return type annotation to accurately reflect what
training_step actually returns, including the tensor type for sigma_per_graph
and the optional dict type for per_sample_info, or alternatively modify the
returned values to match the declared return type by serializing the sigma
tensor to a scalar or list suitable for metric logging.
- Line 1001: The docstring for `water_true` at line 1001 states it should be
`None` when `water_ratio` or `water_count` is used, but the result builders are
returning the original ground-truth waters instead of `None`. Fix this by
applying conditional logic in the result builders (around lines 1024-1031, 1122,
and 1145-1152) to check if `water_ratio` or `water_count` flags are set, and if
so, set `water_true` to `None` rather than returning the ground-truth water
values. Follow the same flag/result logic pattern already implemented in the
`rk4_integrate()` function to ensure consistency across all result building
locations.
- Around line 676-682: The _effective_dynamic_edge_policy method correctly
resolves the edge policy to either "radius" or "knn_if_isolated" and
training/validation attach this to the batch, but the
ProteinWaterUpdate.build_edges() method unconditionally builds KNN edges and
ignores this attribute entirely. Modify the build_edges() method to read the
dynamic_edge_policy attribute from the batch input and conditionally construct
either KNN edges or radius-based edges based on its value instead of always
building KNN edges. This will ensure the configured policy actually affects the
edge construction at runtime.
- Around line 931-959: Add input validation for the water_count parameter at the
beginning of the _setup_water_nodes_from_count method to reject negative or
invalid values before they propagate into the num_waters tensor and downstream
_sample_waters call. Raise a descriptive ValueError if water_count is less than
or equal to zero. Apply the same validation logic to the other methods mentioned
in lines 1024-1031 and 1145-1152 that also accept similar count parameters as
public API boundaries.
- Around line 84-87: Add a guard check before the anchor selection at the line
with `anchors = protein_pos.to(device)[graph_offsets + local_idx]` to ensure
that any graph being processed has at least one protein atom available.
Specifically, verify that all values in `graph_sizes` (which comes from
`num_p_per_graph[batch_w]`) are greater than zero before proceeding with the
local_idx computation and anchor selection, otherwise raise an informative error
that indicates a graph with waters but no protein atoms was encountered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fe982420-7025-40de-ad55-6aa3248b0d89
📒 Files selected for processing (2)
src/flow.pytests/test_flow.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_flow.py`:
- Around line 572-577: The real-structure sampling test is missing a fixed RNG
seed, so its random sampling is not reproducible. Update
test_real_structure_cutoff_and_batch to set a deterministic seed with
torch.manual_seed before any sampling occurs, keeping the cutoff check stable
and making any rare boundary failures repeatable. Use the test function name to
locate the change and ensure the seed is applied at the start of the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 59d5fa95-1c43-41e4-a6aa-55f364451518
📒 Files selected for processing (2)
src/flow.pytests/test_flow.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/flow.py
| @pytest.mark.slow | ||
| def test_real_structure_cutoff_and_batch(self, device, pdb_6eey): | ||
| """Cutoff guarantee holds on real protein geometry; batch indexing is correct | ||
| when two structures with different water counts are packed into one call.""" | ||
| import biotite.structure as bts | ||
| from biotite.structure.io.pdb import get_structure, PDBFile |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Seed RNG in this real-structure test for reproducibility.
Unlike the other sampler tests, this one performs random sampling without torch.manual_seed. Combined with the tight cutoff + 1e-4 tolerance and float32 distances on real (large-magnitude) coordinates, any rare boundary failure would be non-reproducible and hard to diagnose. Add a fixed seed.
🌱 Proposed seeding
def test_real_structure_cutoff_and_batch(self, device, pdb_6eey):
"""Cutoff guarantee holds on real protein geometry; batch indexing is correct
when two structures with different water counts are packed into one call."""
+ torch.manual_seed(0)
import biotite.structure as bts
from biotite.structure.io.pdb import get_structure, PDBFile📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.mark.slow | |
| def test_real_structure_cutoff_and_batch(self, device, pdb_6eey): | |
| """Cutoff guarantee holds on real protein geometry; batch indexing is correct | |
| when two structures with different water counts are packed into one call.""" | |
| import biotite.structure as bts | |
| from biotite.structure.io.pdb import get_structure, PDBFile | |
| `@pytest.mark.slow` | |
| def test_real_structure_cutoff_and_batch(self, device, pdb_6eey): | |
| """Cutoff guarantee holds on real protein geometry; batch indexing is correct | |
| when two structures with different water counts are packed into one call.""" | |
| torch.manual_seed(0) | |
| import biotite.structure as bts | |
| from biotite.structure.io.pdb import get_structure, PDBFile |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_flow.py` around lines 572 - 577, The real-structure sampling test
is missing a fixed RNG seed, so its random sampling is not reproducible. Update
test_real_structure_cutoff_and_batch to set a deterministic seed with
torch.manual_seed before any sampling occurs, keeping the cutoff check stable
and making any rare boundary failures repeatable. Use the test function name to
locate the change and ensure the seed is applied at the start of the test.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/flow.py:977
- Same as
_setup_water_nodes_from_ratio(): avoid hard-coding water feature width to 16 here. Derive the feature width/dtype from the existingg['water'].xwhen present to keep node feature dimensionality consistent.
# create water features (oxygen one-hot, index 2 for 'O' in ELEMENT_VOCAB)
water_x = torch.zeros(total_waters, 16, device=device)
water_x[:, 2] = 1.0 # oxygen is index 2 in ELEMENT_VOCAB
sample_waters_uniform_ball— places each water uniformly inside a ball of radius cutoff around a randomly chosen protein anchor atom, with volume correction for sampling uniformly within a ball of radiusr.sample_waters_scaled_gaussian—N(0, (sigma^2)*I)wheresigmais the per-graph std of protein coordinates. Equivalent to old behavior but explicit.Added the above to the
FlowMatcherclassTests cover shape/count correctness, the empty-graph edge case, and a real-structure test against 6eey that verifies the cutoff guarantee holds on actual protein geometry with two graphs of different water counts batched together.
Summary by CodeRabbit
Release Notes
New Features
water_count(in addition towater_ratio) when sampling initial water nodes.Tests