feat: RSCC eval updates#224
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR adds shared altloc and density utilities, introduces new evaluation scripts, refactors RSCC processing around grouped parallel execution, and updates related tests, documentation, and runtime dependencies. ChangesAltloc Evaluation and RSCC Processing
Sequence Diagram(s)sequenceDiagram
participant main
participant Parallel
participant process_group
participant build_density_transformer
participant run_density_transformer
participant rscc
main->>Parallel: submit (protein, occupancy_key) groups
Parallel->>process_group: process_group(trials, protein, config, coords, map, group_index)
process_group->>build_density_transformer: build transformer from base map
process_group->>run_density_transformer: compute full density for a trial
process_group->>rscc: compute per-selection RSCC
process_group-->>Parallel: return rows
Parallel-->>main: flatten results
sequenceDiagram
participant main
participant process_structure
participant build_pairwise_altloc_arrays
participant _find_max_rmsd_window
participant output
main->>process_structure: load row and resolve CIF
process_structure->>build_pairwise_altloc_arrays: build altloc pair arrays
loop each selection
process_structure->>_find_max_rmsd_window: search windows when selection is long enough
_find_max_rmsd_window-->>process_structure: best window, max RMSD, altloc pair
end
process_structure->>output: emit grouped rows
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
scripts/eval/find_max_rmsd_subsegment.py (1)
38-39: Expand these function docstrings to NumPy style and document side effects.
_process_structure()loads CIF data, andmain()reads/writes CSV files; those side effects should be explicit in the function docs.As per coding guidelines, “Always include NumPy-style docstrings for every function and class” and “ALWAYS annotate complex array shapes and note side effects.”
Also applies to: 99-104, 185-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 38 - 39, Expand the short docstring for _has_compositional_heterogeneity into a full NumPy-style docstring: document Parameters (arr_i: np.ndarray, arr_j: np.ndarray — expected shapes and dtype/contents, mask: np.ndarray[bool] — shape must match arr_i/arr_j), the Returns (bool) and add a Notes/Side effects section stating there are no I/O side effects and the function only inspects arrays; do the same for _process_structure (document parameters, return type, and explicitly state the side effect that it loads CIF data from disk/parses CIF content and any mutation it performs on in-memory structure) and for main (document CLI args, that it reads and writes CSV files, and that it has disk I/O side effects and exit behavior); also update the other function docstrings called out in the review (the functions referenced at the other noted ranges) to NumPy-style including array shapes and explicit side-effect notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 265-267: The CLI currently accepts non-positive --window-size
values which break the sliding-window logic; add a validation after parsing
(before calling main(args)) that checks args.window_size is an integer > 0 and
rejects otherwise (raise SystemExit or parser.error with a clear message).
Reference the parser variable and the main(args) call in
find_max_rmsd_subsegment.py so the guard runs immediately after
parser.parse_args() and prevents running the sliding-window routine with 0 or
negative sizes.
- Around line 81-88: The current check only ensures arr_i has some atoms for the
window, allowing partial residue coverage to be scored; change the logic to
require complete residue coverage for both altloc arrays before computing RMSD:
compute masks for both arr_i and arr_j (e.g., mask_i = (arr_i.chain_id == chain)
& np.isin(arr_i.res_id, window_res) and mask_j = (arr_j.chain_id == chain) &
np.isin(arr_j.res_id, window_res)), then verify that the set/unique res_id
values present in arr_i[mask_i] and arr_j[mask_j] exactly match window_res (or
that the count of unique res_ids equals len(window_res)); if either side is
incomplete, continue and do not call _has_compositional_heterogeneity or
biotite_rmsd on that window.
In `@scripts/eval/rscc_grid_search_script.py`:
- Line 247: The current exception handler "except (FileNotFoundError, OSError,
ValueError, RuntimeError) as e" can let AttributeError or TypeError raised
during parsing/filtering bubble up and abort the whole RSCC run; update that
except tuple to also include AttributeError and TypeError so trial-level
parsing/attribute/type failures are caught and treated as per-selection errors
(emitting the error row and continuing), i.e., change the handler around "except
(FileNotFoundError, OSError, ValueError, RuntimeError) as e" to "except
(FileNotFoundError, OSError, ValueError, RuntimeError, AttributeError,
TypeError) as e" and ensure the existing per-trial error logging/emission logic
is used for these cases.
In `@src/sampleworks/eval/grid_search_eval_utils.py`:
- Around line 27-39: The code builds Path objects from row["structure"] and
row["structure_pattern"] without guarding against pandas NaN (np.nan) which
truth-tests truthy and can cause TypeError; before constructing Path, normalize
and validate these cells: use pandas.isna(row["structure"]) (or isinstance
check) to treat NaN as missing, coerce non-empty values with
str(row["structure"]).strip() and only call Path(...) when that normalized
string is non-empty, and apply the same normalization/validation for
row["structure_pattern"] (variables p, pattern, and usage of cif_root should
remain the same); if both normalized values are missing raise the intended
ValueError with row.to_dict().
In `@src/sampleworks/utils/density_utils.py`:
- Around line 115-120: The code sets use_cuda_kernels based only on host CUDA
availability which can enable CUDA paths even when callers pass
device=torch.device("cpu"); change the call that constructs
DifferentiableTransformer so use_cuda_kernels is true only when the requested
device is CUDA and CUDA is available (e.g., use_cuda_kernels = (device.type ==
"cuda" and torch.cuda.is_available())); update the instantiation at
DifferentiableTransformer(...) and ensure any downstream code (e.g.,
dilate_atom_centric which calls torch.cuda.synchronize) will only run CUDA paths
when that flag reflects the requested device.
---
Nitpick comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 38-39: Expand the short docstring for
_has_compositional_heterogeneity into a full NumPy-style docstring: document
Parameters (arr_i: np.ndarray, arr_j: np.ndarray — expected shapes and
dtype/contents, mask: np.ndarray[bool] — shape must match arr_i/arr_j), the
Returns (bool) and add a Notes/Side effects section stating there are no I/O
side effects and the function only inspects arrays; do the same for
_process_structure (document parameters, return type, and explicitly state the
side effect that it loads CIF data from disk/parses CIF content and any mutation
it performs on in-memory structure) and for main (document CLI args, that it
reads and writes CSV files, and that it has disk I/O side effects and exit
behavior); also update the other function docstrings called out in the review
(the functions referenced at the other noted ranges) to NumPy-style including
array shapes and explicit side-effect notes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eb4437dd-4cc2-47be-b5e4-b7b713559434
📒 Files selected for processing (14)
.gitignoreCLAUDE.mdscripts/eval/classify_altloc_regions.pyscripts/eval/find_max_rmsd_subsegment.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.pytests/eval/conftest.pytests/eval/test_rscc_grid_search_script.pytests/resources/1vme/1VME_0.25occA_0.75occB_1.00A.ccp4tests/resources/1vme/1VME_0.5occA_0.5occB_1.00A.ccp4tests/resources/1vme/1VME_single_001_density_input.ciftests/utils/test_density_utils.py
| mask = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res) | ||
| if mask.sum() == 0: | ||
| continue | ||
|
|
||
| if _has_compositional_heterogeneity(arr_i, arr_j, mask): | ||
| continue | ||
|
|
||
| rmsd_val = float(biotite_rmsd(arr_i[mask], arr_j[mask])) |
There was a problem hiding this comment.
Require complete residue coverage before scoring a window.
Line 82 only checks for any atoms, so a window with missing residues in either altloc can still be scored and later emitted as the full best_res[0]-best_res[-1] range. Skip windows unless both masked arrays contain exactly window_res.
🐛 Proposed fix
mask = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res)
if mask.sum() == 0:
continue
+ res_ids_i, _ = get_residues(arr_i[mask])
+ res_ids_j, _ = get_residues(arr_j[mask])
+ if list(map(int, res_ids_i)) != window_res or list(map(int, res_ids_j)) != window_res:
+ continue
+
if _has_compositional_heterogeneity(arr_i, arr_j, mask):
continue
rmsd_val = float(biotite_rmsd(arr_i[mask], arr_j[mask]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 81 - 88, The current
check only ensures arr_i has some atoms for the window, allowing partial residue
coverage to be scored; change the logic to require complete residue coverage for
both altloc arrays before computing RMSD: compute masks for both arr_i and arr_j
(e.g., mask_i = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res)
and mask_j = (arr_j.chain_id == chain) & np.isin(arr_j.res_id, window_res)),
then verify that the set/unique res_id values present in arr_i[mask_i] and
arr_j[mask_j] exactly match window_res (or that the count of unique res_ids
equals len(window_res)); if either side is incomplete, continue and do not call
_has_compositional_heterogeneity or biotite_rmsd on that window.
ab2de06 to
d2e5ca0
Compare
d2e5ca0 to
1b781ea
Compare
4cf1ffc to
a657989
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/eval/rscc_grid_search_script.py (1)
124-124:⚠️ Potential issue | 🟠 MajorException tuples still let
AttributeError/TypeErrorescape.Both the setup handler (Line 124) and the per-trial handler (Line 209) catch
(FileNotFoundError, OSError, ValueError, RuntimeError), but:
- Line 147 explicitly raises
AttributeError("AtomArray | AtomArrayStack is missing coordinates").atom_array.set_annotation(...)(Line 153) and biotite parsing/attribute access can plausibly raiseAttributeError/TypeErroron malformed inputs.Under the per-trial handler, these uncaught errors currently abort the whole
process_groupcall, which — becausejoblib.Parallelre-raises worker exceptions — will terminate the entire RSCC run instead of emitting per-selectionrscc=nanrows. Apply the same widening to both handlers.🛡️ Proposed fix
- except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: + except ( + FileNotFoundError, + OSError, + ValueError, + RuntimeError, + AttributeError, + TypeError, + ) as e: logger.error(f"ERROR setting up group {protein}/{trials[0].altloc_occupancies}: {e}")- except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: + except ( + FileNotFoundError, + OSError, + ValueError, + RuntimeError, + AttributeError, + TypeError, + ) as e: logger.error(f"ERROR processing trial {trial.trial_dir}: {e}")Also applies to: 209-209
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/rscc_grid_search_script.py` at line 124, The except clauses that currently catch (FileNotFoundError, OSError, ValueError, RuntimeError) should also include AttributeError and TypeError so malformed inputs don't crash the whole run; update the two handlers (the setup handler around the code that may raise AttributeError like the explicit raise and atom_array.set_annotation, and the per-trial handler inside process_group/worker) to catch (FileNotFoundError, OSError, ValueError, RuntimeError, AttributeError, TypeError) as e and preserve the existing fallback behavior (e.g., emit rscc=nan or continue) so errors become per-selection failures rather than terminating the job.
🧹 Nitpick comments (3)
tests/eval/test_rscc_grid_search_script.py (1)
118-118: Unused default argument inflaky_rscc.
_sel=bad_selectionis never referenced inside the function body; the failure is decided purely by thetarget_is_nextflag. Drop it to avoid confusion.♻️ Proposed refactor
- def flaky_rscc(a, b, _sel=bad_selection): + def flaky_rscc(a, b): if flaky_rscc.target_is_next: # type: ignore[attr-defined] flaky_rscc.target_is_next = False # type: ignore[attr-defined] raise RuntimeError("simulated rscc failure") return real_rscc(a, b)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/test_rscc_grid_search_script.py` at line 118, The function flaky_rscc currently declares an unused default argument `_sel=bad_selection`; remove this unused parameter from the `flaky_rscc` signature (drop `_sel` and its `bad_selection` default) and update any call sites or test invocations to no longer pass or expect that argument; ensure only `a`, `b`, and the existing `target_is_next` logic remain in `flaky_rscc`.tests/eval/conftest.py (2)
87-90: Assertion-based validation disappears underpython -O.These
assertcalls are the only validation for fixture inputs; they’re silently skipped when Python is run with-O. Pytest doesn’t use-Oby default, but if anyone invokes it that way, invalid arguments will produce confusing downstream filesystem errors. Prefer explicitraise ValueError(...)/FileNotFoundError(...)for input validation and reserveassertfor invariants.♻️ Proposed refactor
- assert _REAL_CIF.exists(), _REAL_CIF - assert 1 <= n_groups <= len(_GROUP_CCP4) - assert trials_per_group >= 1 - assert len(selections) >= 1 + if not _REAL_CIF.exists(): + raise FileNotFoundError(_REAL_CIF) + if not 1 <= n_groups <= len(_GROUP_CCP4): + raise ValueError(f"n_groups must be in [1, {len(_GROUP_CCP4)}], got {n_groups}") + if trials_per_group < 1: + raise ValueError(f"trials_per_group must be >= 1, got {trials_per_group}") + if len(selections) < 1: + raise ValueError("selections must not be empty")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/conftest.py` around lines 87 - 90, Replace the assertion-based input checks that disappear under python -O with explicit exceptions: check existence of _REAL_CIF and raise FileNotFoundError with a helpful message if missing; validate n_groups against len(_GROUP_CCP4) and raise ValueError when out of range; ensure trials_per_group >= 1 and selections length >= 1 and raise ValueError with clear messages for each invalid condition; update the checks located near the fixture that references variables _REAL_CIF, n_groups, _GROUP_CCP4, trials_per_group, and selections accordingly.
71-78: Symlink-only approach is Linux-centric.
dst.symlink_to(src)requires privileges on Windows and is also silently broken inside some sandboxed tempdirs. Given tests undertests/eval/depend exclusively on symlinks, consider either:
- Falling back to
shutil.copy2ifsymlink_toraisesOSError/NotImplementedError, or- Skipping this module on non-POSIX via a module-level
pytest.importorskip/skipifguard.Not blocking for this PR if Linux-only CI is acceptable, but worth documenting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/conftest.py` around lines 71 - 78, The _link function currently uses dst.symlink_to(src) which fails on Windows or in sandboxed tempdirs; update _link to catch OSError and NotImplementedError around dst.symlink_to(src) and fall back to copying with shutil.copy2(src, dst) (ensure dst.parent exists and remove any existing dst before the fallback), and alternatively consider adding a module-level guard using pytest.importorskip("posix") or pytest.mark.skipif(not os.name == "posix", reason="symlink-only tests") if you prefer skipping on non-POSIX environments; refer to the _link function name and dst.symlink_to and use shutil.copy2 and pytest.importorskip/skipif in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/atom_array_utils.py`:
- Around line 759-776: Update the function-level docstring to a NumPy-style
docstring: add a Parameters section documenting atom_array (type:
AtomArray-like, expected shape/contents) and altloc_ids (iterable of str), a
Returns section describing that the function returns a dict mapping unordered
altloc id pairs (tuple of str) to tuples of AtomArray (array_i, array_j) and
explicitly note that pairs which raise RuntimeError inside
filter_to_common_atoms are omitted from the returned dict (and that a warning is
emitted), and a Raises section documenting any raised exceptions (e.g.,
propagate unexpected exceptions but note that RuntimeError from
filter_to_common_atoms is handled by omission). Reference
select_altloc(return_full_array=True) and filter_to_common_atoms in the
description so callers know how arrays are built and filtered.
---
Duplicate comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Line 124: The except clauses that currently catch (FileNotFoundError, OSError,
ValueError, RuntimeError) should also include AttributeError and TypeError so
malformed inputs don't crash the whole run; update the two handlers (the setup
handler around the code that may raise AttributeError like the explicit raise
and atom_array.set_annotation, and the per-trial handler inside
process_group/worker) to catch (FileNotFoundError, OSError, ValueError,
RuntimeError, AttributeError, TypeError) as e and preserve the existing fallback
behavior (e.g., emit rscc=nan or continue) so errors become per-selection
failures rather than terminating the job.
---
Nitpick comments:
In `@tests/eval/conftest.py`:
- Around line 87-90: Replace the assertion-based input checks that disappear
under python -O with explicit exceptions: check existence of _REAL_CIF and raise
FileNotFoundError with a helpful message if missing; validate n_groups against
len(_GROUP_CCP4) and raise ValueError when out of range; ensure trials_per_group
>= 1 and selections length >= 1 and raise ValueError with clear messages for
each invalid condition; update the checks located near the fixture that
references variables _REAL_CIF, n_groups, _GROUP_CCP4, trials_per_group, and
selections accordingly.
- Around line 71-78: The _link function currently uses dst.symlink_to(src) which
fails on Windows or in sandboxed tempdirs; update _link to catch OSError and
NotImplementedError around dst.symlink_to(src) and fall back to copying with
shutil.copy2(src, dst) (ensure dst.parent exists and remove any existing dst
before the fallback), and alternatively consider adding a module-level guard
using pytest.importorskip("posix") or pytest.mark.skipif(not os.name == "posix",
reason="symlink-only tests") if you prefer skipping on non-POSIX environments;
refer to the _link function name and dst.symlink_to and use shutil.copy2 and
pytest.importorskip/skipif in your change.
In `@tests/eval/test_rscc_grid_search_script.py`:
- Line 118: The function flaky_rscc currently declares an unused default
argument `_sel=bad_selection`; remove this unused parameter from the
`flaky_rscc` signature (drop `_sel` and its `bad_selection` default) and update
any call sites or test invocations to no longer pass or expect that argument;
ensure only `a`, `b`, and the existing `target_is_next` logic remain in
`flaky_rscc`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 32784a89-eb17-409e-b26c-e9fc7049a8eb
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
CLAUDE.mdpyproject.tomlscripts/eval/classify_altloc_regions.pyscripts/eval/find_max_rmsd_subsegment.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.pytests/eval/conftest.pytests/eval/test_rscc_grid_search_script.pytests/resources/1vme/1VME_0.25occA_0.75occB_1.00A.ccp4tests/resources/1vme/1VME_0.5occA_0.5occB_1.00A.ccp4tests/resources/1vme/1VME_single_001_density_input.ciftests/utils/test_density_utils.py
✅ Files skipped from review due to trivial changes (2)
- pyproject.toml
- CLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (2)
- src/sampleworks/utils/density_utils.py
- scripts/eval/find_max_rmsd_subsegment.py
a657989 to
70a0cbc
Compare
marcuscollins
left a comment
There was a problem hiding this comment.
Some new comments, plus old ones... please address and I'll take another look through.
70a0cbc to
2827b37
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/eval/find_max_rmsd_subsegment.py (1)
77-82: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winRequire full residue coverage before scoring a window.
Line 78 only rejects empty masks, so partial residue windows can still be scored and later emitted as full
start-endranges. This can mislabel the selected subsegment.🐛 Proposed fix
- mask = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res) - if mask.sum() == 0: + mask_i = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res) + mask_j = (arr_j.chain_id == chain) & np.isin(arr_j.res_id, window_res) + if mask_i.sum() == 0 or mask_j.sum() == 0: + continue + + expected = set(window_res) + if set(map(int, np.unique(arr_i.res_id[mask_i]))) != expected: + continue + if set(map(int, np.unique(arr_j.res_id[mask_j]))) != expected: + continue + if mask_i.sum() != mask_j.sum(): continue - rmsd_val = float(biotite_rmsd(arr_i[mask], arr_j[mask])) + rmsd_val = float(biotite_rmsd(arr_i[mask_i], arr_j[mask_j]))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 77 - 82, The mask validation at line 78 only rejects completely empty masks but allows partial residue coverage, meaning windows with missing residues can still be scored and emitted with incorrect start-end ranges. Modify the condition that checks if mask.sum() == 0 to additionally verify that the mask contains all expected residues by ensuring mask.sum() equals the length of window_res, rejecting any windows with incomplete residue coverage before the biotite_rmsd calculation.
🧹 Nitpick comments (6)
AGENTS.md (1)
23-31: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winClarify the Atomworks loader contract.
This section mixes the parsed-structure dict path with the raw-array loader path, but it doesn’t say which callers should use which representation. That can lead agents to pass the wrong object into
structure: dictAPIs. Please verify theload_any()name/return type against Atomworks docs before locking this guidance in.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AGENTS.md` around lines 23 - 31, Clarify the distinction between the two structure loading paths in the Atomworks documentation section. Specifically, verify the actual function name and return type of the raw-array loader against the current Atomworks documentation (the section currently references `load_any()` but this should be confirmed), then explicitly state which loader function (atomworks.parse() for dict or the raw-array loader) should be used for which use cases and clarify that only the atomworks.parse() dictionary output should be passed to structure: dict parameters.tests/eval/conftest.py (2)
41-47: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd a NumPy-style docstring to
_populate(and the nested_factory).Per the repo guideline, every function should carry a NumPy-style docstring.
_populatebuilds a non-trivial fixture tree and would benefit most;_factoryat Line 168 is also undocumented.As per coding guidelines: "Always include NumPy-style docstrings for every function and class".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/conftest.py` around lines 41 - 47, The functions `_populate` and `_factory` are missing NumPy-style docstrings required by the coding guidelines. Add a NumPy-style docstring to the `_populate` function immediately after its signature that documents the function's purpose, parameters (root, n_groups, trials_per_group, selections, base_map), and return value (argparse.Namespace). Similarly, add a NumPy-style docstring to the nested `_factory` function around line 168 that documents its purpose, parameters, and return type. Follow the standard NumPy docstring format with sections for Summary, Parameters, and Returns.Source: Coding guidelines
118-138: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueGrid sizing keys off
coords.max(...)only. The unit-cell/grid is sized from the max coordinate plus padding with the origin implicitly at 0, so atoms with negative coordinates would fall outside the box. This is harmless here since the base and computed maps share the same grid and the refined CIF is identical to the reference (self-correlation stays ~1.0), but it's worth a brief inline note so a future change doesn't silently clip density.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/conftest.py` around lines 118 - 138, The grid sizing calculation in the line containing np.ceil((coords.max(axis=0) + 5.0) / voxel) only considers the maximum coordinates and implicitly places the origin at 0, which means atoms with negative coordinates would fall outside the grid box. Add an inline comment at this grid_shape assignment explaining that the grid origin is implicitly at (0, 0, 0) and that the sizing only accounts for maximum coordinates, making it clear that negative coordinates are not accounted for. This will document the assumption for future maintainers.tests/eval/test_find_altloc_selections_script.py (1)
20-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winStandardize helper docs to NumPy-style and add one for
_make_args.Line 54 introduces
_make_argswithout a docstring, and the helper/fixture docstrings in this block are not in NumPy style.As per coding guidelines,
**/*.py: "Always include NumPy-style docstrings for every function and class".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_find_altloc_selections_script.py` around lines 20 - 68, The functions find_altloc_script, _altloc_input_csv, and _make_args need docstring updates to comply with NumPy-style documentation standards. Update the existing docstrings for find_altloc_script and _altloc_input_csv to follow NumPy-style format (including proper sections like Parameters, Returns, Raises where applicable), and add a new NumPy-style docstring to _make_args which currently has no documentation. Ensure all docstrings include appropriate sections that describe the function's purpose, parameters, return values, and any exceptions raised.Source: Coding guidelines
scripts/eval/find_max_rmsd_subsegment.py (1)
92-97: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd NumPy-style docstrings for
_process_structureandmain.
_process_structurehas a short non-NumPy docstring, andmainhas none.As per coding guidelines,
**/*.py: "Always include NumPy-style docstrings for every function and class".Also applies to: 166-167
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 92 - 97, The functions `_process_structure` and `main` are missing proper NumPy-style docstrings. For `_process_structure`, replace the current short docstring with a complete NumPy-style docstring that documents the parameters (row, cif_root, window_size) and return type (list of dictionaries). For `main`, add a complete NumPy-style docstring that documents its purpose, parameters, and return type. Follow the NumPy docstring format with Parameters and Returns sections as specified in the coding guidelines.Source: Coding guidelines
src/sampleworks/eval/structure_utils.py (1)
342-351: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse a NumPy-style docstring for
selection_to_residues.The current docstring is clear, but it does not follow the required NumPy docstring format used in this repo.
As per coding guidelines,
**/*.py: "Always include NumPy-style docstrings for every function and class".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/structure_utils.py` around lines 342 - 351, The docstring for the selection_to_residues function needs to be converted to NumPy-style format to match the repository's coding guidelines. Keep the existing summary and description content, then restructure it to include standard NumPy sections: add a Parameters section documenting the atom_array and selection arguments with their types and descriptions, add a Returns section describing the set[tuple[str, int]] return value, and move the TODO comment to a Notes section. Ensure proper indentation and formatting consistent with NumPy docstring conventions used elsewhere in the codebase.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 135-140: The CSV validation currently only requires `protein` and
`selection` fields, but the script depends on `structure_pattern`,
`map_pattern`, `base_map_dir`, and `resolution` which are accessed with fallback
empty strings in the dict construction, masking bad input. Move the field
validation to the CSV validation point (around line 168) to verify all required
fields are present, raising a clear error message if any are missing, then
remove the fallback empty strings in the dict construction (lines 137-140) by
accessing the row fields directly without .get() defaults, since they will be
guaranteed to exist after validation.
---
Duplicate comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 77-82: The mask validation at line 78 only rejects completely
empty masks but allows partial residue coverage, meaning windows with missing
residues can still be scored and emitted with incorrect start-end ranges. Modify
the condition that checks if mask.sum() == 0 to additionally verify that the
mask contains all expected residues by ensuring mask.sum() equals the length of
window_res, rejecting any windows with incomplete residue coverage before the
biotite_rmsd calculation.
---
Nitpick comments:
In `@AGENTS.md`:
- Around line 23-31: Clarify the distinction between the two structure loading
paths in the Atomworks documentation section. Specifically, verify the actual
function name and return type of the raw-array loader against the current
Atomworks documentation (the section currently references `load_any()` but this
should be confirmed), then explicitly state which loader function
(atomworks.parse() for dict or the raw-array loader) should be used for which
use cases and clarify that only the atomworks.parse() dictionary output should
be passed to structure: dict parameters.
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 92-97: The functions `_process_structure` and `main` are missing
proper NumPy-style docstrings. For `_process_structure`, replace the current
short docstring with a complete NumPy-style docstring that documents the
parameters (row, cif_root, window_size) and return type (list of dictionaries).
For `main`, add a complete NumPy-style docstring that documents its purpose,
parameters, and return type. Follow the NumPy docstring format with Parameters
and Returns sections as specified in the coding guidelines.
In `@src/sampleworks/eval/structure_utils.py`:
- Around line 342-351: The docstring for the selection_to_residues function
needs to be converted to NumPy-style format to match the repository's coding
guidelines. Keep the existing summary and description content, then restructure
it to include standard NumPy sections: add a Parameters section documenting the
atom_array and selection arguments with their types and descriptions, add a
Returns section describing the set[tuple[str, int]] return value, and move the
TODO comment to a Notes section. Ensure proper indentation and formatting
consistent with NumPy docstring conventions used elsewhere in the codebase.
In `@tests/eval/conftest.py`:
- Around line 41-47: The functions `_populate` and `_factory` are missing
NumPy-style docstrings required by the coding guidelines. Add a NumPy-style
docstring to the `_populate` function immediately after its signature that
documents the function's purpose, parameters (root, n_groups, trials_per_group,
selections, base_map), and return value (argparse.Namespace). Similarly, add a
NumPy-style docstring to the nested `_factory` function around line 168 that
documents its purpose, parameters, and return type. Follow the standard NumPy
docstring format with sections for Summary, Parameters, and Returns.
- Around line 118-138: The grid sizing calculation in the line containing
np.ceil((coords.max(axis=0) + 5.0) / voxel) only considers the maximum
coordinates and implicitly places the origin at 0, which means atoms with
negative coordinates would fall outside the grid box. Add an inline comment at
this grid_shape assignment explaining that the grid origin is implicitly at (0,
0, 0) and that the sizing only accounts for maximum coordinates, making it clear
that negative coordinates are not accounted for. This will document the
assumption for future maintainers.
In `@tests/eval/test_find_altloc_selections_script.py`:
- Around line 20-68: The functions find_altloc_script, _altloc_input_csv, and
_make_args need docstring updates to comply with NumPy-style documentation
standards. Update the existing docstrings for find_altloc_script and
_altloc_input_csv to follow NumPy-style format (including proper sections like
Parameters, Returns, Raises where applicable), and add a new NumPy-style
docstring to _make_args which currently has no documentation. Ensure all
docstrings include appropriate sections that describe the function's purpose,
parameters, return values, and any exceptions raised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 04fcd479-6b92-48a6-adbf-6abaef293a2c
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
AGENTS.mdCLAUDE.mdpyproject.tomlscripts/eval/classify_altloc_regions.pyscripts/eval/find_max_rmsd_subsegment.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/eval/structure_utils.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.pytests/eval/conftest.pytests/eval/test_find_altloc_selections_script.pytests/eval/test_rscc_grid_search_script.pytests/resources/1vme/1VME_single_001_density_input.ciftests/utils/test_density_utils.py
✅ Files skipped from review due to trivial changes (2)
- CLAUDE.md
- pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (5)
- src/sampleworks/utils/atom_array_utils.py
- tests/utils/test_density_utils.py
- src/sampleworks/utils/density_utils.py
- scripts/eval/rscc_grid_search_script.py
- scripts/eval/classify_altloc_regions.py
| out: dict = { | ||
| "protein": protein, | ||
| "structure_pattern": row.get("structure_pattern", ""), | ||
| "map_pattern": row.get("map_pattern", ""), | ||
| "base_map_dir": row.get("base_map_dir", ""), | ||
| "resolution": row.get("resolution", ""), |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Validate the full input CSV contract up front.
Line 168 currently requires only protein and selection, but this script and downstream RSCC flow depend on structure_pattern, map_pattern, base_map_dir, and resolution. Falling back to empty strings at Line 137-140 masks bad input and pushes failure downstream.
🛡️ Proposed fix
- required = {"protein", "selection"}
+ required = {
+ "protein",
+ "selection",
+ "structure_pattern",
+ "map_pattern",
+ "base_map_dir",
+ "resolution",
+ }
missing = required - set(input_df.columns)
if missing:
raise ValueError(f"Input CSV missing required columns: {missing}")
@@
out: dict = {
"protein": protein,
- "structure_pattern": row.get("structure_pattern", ""),
- "map_pattern": row.get("map_pattern", ""),
- "base_map_dir": row.get("base_map_dir", ""),
- "resolution": row.get("resolution", ""),
+ "structure_pattern": row["structure_pattern"],
+ "map_pattern": row["map_pattern"],
+ "base_map_dir": row["base_map_dir"],
+ "resolution": row["resolution"],
"original_selection": sel_str,
}As per coding guidelines, **/*.py: "Fail fast with clear error messages when preconditions are violated".
Also applies to: 168-171
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 135 - 140, The CSV
validation currently only requires `protein` and `selection` fields, but the
script depends on `structure_pattern`, `map_pattern`, `base_map_dir`, and
`resolution` which are accessed with fallback empty strings in the dict
construction, masking bad input. Move the field validation to the CSV validation
point (around line 168) to verify all required fields are present, raising a
clear error message if any are missing, then remove the fallback empty strings
in the dict construction (lines 137-140) by accessing the row fields directly
without .get() defaults, since they will be guaranteed to exist after
validation.
Source: Coding guidelines
marcuscollins
left a comment
There was a problem hiding this comment.
Thanks for the changes.
Script for finding the max rmsd subsegment in altlocs greater than a certain length, and speedups for the RSCC grid search script.
This depends on #223 and should only be merged afterwards
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores
joblibdependency to support parallel processing.