feat(eval): canonicalize mixed-altloc modified residues#263
Conversation
|
Warning Review limit reached
More reviews will be available in 59 minutes and 39 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds two helpers to ChangesMixed altloc canonicalization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds an evaluation-time structure preprocessing utility to handle compositional heterogeneity where modified residues (e.g., CSO) appear as altlocs alongside their canonical forms (e.g., CYS), enabling reference structures like 6NI6/6NI5 to be processed successfully for evaluation.
Changes:
- Introduces
_closest_canonical_amino_acid()andcanonicalize_mixed_altloc_residues()ineval/structure_utils.pyto canonicalize mixed-altloc modified residues. - Wires the canonicalization step into
get_reference_atomarraystack()prior to stacking altlocs. - Adds unit tests covering mapping behavior and non-mutating semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/sampleworks/eval/structure_utils.py |
Adds canonicalization utilities and applies them during reference structure loading to prevent altloc stacking failures. |
tests/eval/test_structure_utils.py |
Adds tests validating the canonical mapping and mixed-position renaming behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/eval/test_structure_utils.py (1)
11-13: ⚡ Quick winAvoid coupling tests to private helper internals.
Line 11 and Line 370 test
_closest_canonical_amino_aciddirectly. Prefer validating behavior throughcanonicalize_mixed_altloc_residues(...)inputs/outputs so tests stay black-box and resilient to internal refactors. As per coding guidelines, "Write black-box tests that verify behavior, not implementation."Also applies to: 366-371
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_structure_utils.py` around lines 11 - 13, Remove the import of the private function _closest_canonical_amino_acid from the imports section, and refactor any direct tests of _closest_canonical_amino_acid to instead validate its behavior indirectly through the public API by testing inputs and outputs of canonicalize_mixed_altloc_residues. This ensures tests remain black-box and resilient to internal refactoring of implementation details.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 353-355: Add NumPy-style docstrings to the test methods in the
specified ranges. For the _mixed_array() method and all other new test methods
in the affected ranges, add docstrings that follow NumPy style, which should
include a brief summary of what the method does, a Returns section describing
what is returned, and any other relevant sections as needed. Each docstring
should be placed immediately after the method signature and before any
implementation code or comments.
---
Nitpick comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 11-13: Remove the import of the private function
_closest_canonical_amino_acid from the imports section, and refactor any direct
tests of _closest_canonical_amino_acid to instead validate its behavior
indirectly through the public API by testing inputs and outputs of
canonicalize_mixed_altloc_residues. This ensures tests remain black-box and
resilient to internal refactoring of implementation details.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ca247f13-768a-4038-8b42-68911a7e58f6
📒 Files selected for processing (2)
src/sampleworks/eval/structure_utils.pytests/eval/test_structure_utils.py
| @staticmethod | ||
| def _mixed_array() -> AtomArray: | ||
| # (A,10): canonical CYS + modified CSO compositional heterogeneity |
There was a problem hiding this comment.
Add NumPy-style docstrings to the new test methods.
The new methods in this block are missing docstrings; this repo requires NumPy-style docstrings for every function and class. As per coding guidelines, "Always include NumPy-style docstrings for every function and class."
Also applies to: 370-384
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval/test_structure_utils.py` around lines 353 - 355, Add NumPy-style
docstrings to the test methods in the specified ranges. For the _mixed_array()
method and all other new test methods in the affected ranges, add docstrings
that follow NumPy style, which should include a brief summary of what the method
does, a Returns section describing what is returned, and any other relevant
sections as needed. Each docstring should be placed immediately after the method
signature and before any implementation code or comments.
Source: Coding guidelines
marcuscollins
left a comment
There was a problem hiding this comment.
I like this and I think it should ultimately replace the more involved method I previously wrote to handle compositional heterogeneity. But I think there are a couple of problems, mainly that I think this will leave extra atoms (most of which are duplicates of each other and probably have an altloc id that is different, but some will be atoms that are incompatible with the new residue type).
| return None if parent == UNKNOWN_AA else parent | ||
|
|
||
|
|
||
| def canonicalize_mixed_altloc_residues( |
There was a problem hiding this comment.
Does this solve the same problem as resolve_mixed_hetatm_atom_altlocs? If so, IIRC AtomWorks inserts an extra residue, and it looks to me like this might not catch that.
Also, if this does solve that same problem, I'd rather we have one solution, not two which could diverge. I'm not saying my solution was better (it is a bit of a hack, tbh, to have to write a temporary CIF file). But before we put this in, we should unify the two (again, if they are the same)
| continue # single consistent res_name means it doesn't have comp het | ||
| parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name)) | ||
| if parent is not None and parent != res_name: | ||
| out.res_name[i] = parent |
There was a problem hiding this comment.
What happens if the two residues don't have the same atoms? I see a couple issues here. One is that AtomWorks' parse will insert extra atoms when there's compositional heterogeneity. So, for example in 6NI6, you would end up with two CA, C, O, etc... as well as extra atoms that aren't compatible with the residue type anymore. I think probably we need to handle that and canonicalize the atoms, not just the residue names.
| if len(res_names_by_position[position]) == 1: | ||
| continue # single consistent res_name means it doesn't have comp het | ||
| parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name)) | ||
| if parent is not None and parent != res_name: |
There was a problem hiding this comment.
Are you trying to canonicalize all residues, or just make the sequences the same? What if a structure has two different non-canonicals at the same position, and no canonicals?
| # form (e.g. CYS) makes map_altlocs_to_stack fail: the conformers disagree on res_name/hetero | ||
| # so biotite.stack() rejects them. Canonicalize these residues to get map_altlocs_to_stack to | ||
| # work | ||
| ref_struct = canonicalize_mixed_altloc_residues(load_structure_with_altlocs(ref_path)) |
There was a problem hiding this comment.
I think looking at your code, you should try to replace my method with yours although I'm pretty sure you need to get rid of extra atoms still.
There was a problem hiding this comment.
If you don't have time, make an issue and tag with engineering.
5c4a8a3 to
4b8cc2a
Compare
Adds a processing utility that is necessary for getting certain proteins (6NI6/6NI5) to work properly in the evaluation scripts, since those contain compositional heterogeneity in the form of a modified amino acid.
Summary by CodeRabbit
New Features
Tests