feat(eval): canonicalize mixed-altloc modified residues by k-chrispens · Pull Request #263 · diff-use/sampleworks

k-chrispens · 2026-06-16T22:36:57Z

Adds a processing utility that is necessary for getting certain proteins (6NI6/6NI5) to work properly in the evaluation scripts, since those contain compositional heterogeneity in the form of a modified amino acid.

Summary by CodeRabbit

New Features
- Improved protein structure processing with enhanced alternate location (altloc) residue handling
- Modified amino acids are now automatically mapped to their canonical parent forms, ensuring proper processing of mixed altloc structures
Tests
- Added comprehensive test coverage for canonicalization functionality

coderabbitai · 2026-06-16T22:37:18Z

Warning

Review limit reached

@k-chrispens, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 39 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f08ef92c-61bb-46f2-9ee5-e623a4a824cf

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4a8a3 and 4b8cc2a.

📒 Files selected for processing (2)

src/sampleworks/eval/structure_utils.py
tests/eval/test_structure_utils.py

📝 Walkthrough

Walkthrough

Adds two helpers to structure_utils.py: _closest_canonical_amino_acid maps modified residue names (e.g., CSO→CYS) to canonical parents, and canonicalize_mixed_altloc_residues normalizes mixed-altloc positions by renaming modified residues and clearing hetero flags. The canonicalization is applied during reference-structure loading. Tests validate all mapping and mutation-safety behaviors.

Changes

Mixed altloc canonicalization

Layer / File(s)	Summary
Canonicalization helpers and supporting imports `src/sampleworks/eval/structure_utils.py`	Adds `defaultdict`, atomworks residue-name conversion imports, `_closest_canonical_amino_acid` (modified→canonical three-letter name, `None` for non-amino-acids), and `canonicalize_mixed_altloc_residues` (copies input, groups atoms by `(chain_id, res_id, ins_code)`, detects `res_name` disagreement across altlocs, renames modified residues to canonical parent, clears `hetero` for renamed records).
Reference-structure integration and tests `src/sampleworks/eval/structure_utils.py`, `tests/eval/test_structure_utils.py`	Pipes loaded reference structure through `canonicalize_mixed_altloc_residues` before downstream processing. `TestCanonicalizeMixedAltlocResidues` covers `_closest_canonical_amino_acid` parameter cases (CSO→CYS, MSE→MET, ALA unchanged, HOH→None), `canonicalize_mixed_altloc_residues` renaming and hetero-clearing at mixed positions, MSE non-mutation, and input immutability.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A residue mixed up and lost its way,
CSO wanted CYS's role to play.
With a copy, a group, a lookup so keen,
The hetero flag cleared, the altloc serene.
Hoppity-hop, canonical and bright —
Every position stacked just right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: adding canonicalization for mixed-altloc modified residues, which aligns with the PR's core functionality and objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kmc/canonicalize-altloc-residues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

Adds an evaluation-time structure preprocessing utility to handle compositional heterogeneity where modified residues (e.g., CSO) appear as altlocs alongside their canonical forms (e.g., CYS), enabling reference structures like 6NI6/6NI5 to be processed successfully for evaluation.

Changes:

Introduces _closest_canonical_amino_acid() and canonicalize_mixed_altloc_residues() in eval/structure_utils.py to canonicalize mixed-altloc modified residues.
Wires the canonicalization step into get_reference_atomarraystack() prior to stacking altlocs.
Adds unit tests covering mapping behavior and non-mutating semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/sampleworks/eval/structure_utils.py`	Adds canonicalization utilities and applies them during reference structure loading to prevent altloc stacking failures.
`tests/eval/test_structure_utils.py`	Adds tests validating the canonical mapping and mixed-position renaming behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/eval/test_structure_utils.py (1)
11-13: ⚡ Quick win

Avoid coupling tests to private helper internals.

Line 11 and Line 370 test _closest_canonical_amino_acid directly. Prefer validating behavior through canonicalize_mixed_altloc_residues(...) inputs/outputs so tests stay black-box and resilient to internal refactors. As per coding guidelines, "Write black-box tests that verify behavior, not implementation."

Also applies to: 366-371
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/eval/test_structure_utils.py` around lines 11 - 13, Remove the import
of the private function _closest_canonical_amino_acid from the imports section,
and refactor any direct tests of _closest_canonical_amino_acid to instead
validate its behavior indirectly through the public API by testing inputs and
outputs of canonicalize_mixed_altloc_residues. This ensures tests remain
black-box and resilient to internal refactoring of implementation details.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 353-355: Add NumPy-style docstrings to the test methods in the
specified ranges. For the _mixed_array() method and all other new test methods
in the affected ranges, add docstrings that follow NumPy style, which should
include a brief summary of what the method does, a Returns section describing
what is returned, and any other relevant sections as needed. Each docstring
should be placed immediately after the method signature and before any
implementation code or comments.

---

Nitpick comments:
In `@tests/eval/test_structure_utils.py`:
- Around line 11-13: Remove the import of the private function
_closest_canonical_amino_acid from the imports section, and refactor any direct
tests of _closest_canonical_amino_acid to instead validate its behavior
indirectly through the public API by testing inputs and outputs of
canonicalize_mixed_altloc_residues. This ensures tests remain black-box and
resilient to internal refactoring of implementation details.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca247f13-768a-4038-8b42-68911a7e58f6

📥 Commits

Reviewing files that changed from the base of the PR and between b87cc5f and 5c4a8a3.

📒 Files selected for processing (2)

src/sampleworks/eval/structure_utils.py
tests/eval/test_structure_utils.py

coderabbitai · 2026-06-16T22:41:06Z

+    @staticmethod
+    def _mixed_array() -> AtomArray:
+        # (A,10): canonical CYS + modified CSO compositional heterogeneity


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add NumPy-style docstrings to the new test methods.

The new methods in this block are missing docstrings; this repo requires NumPy-style docstrings for every function and class. As per coding guidelines, "Always include NumPy-style docstrings for every function and class."

Also applies to: 370-384

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_structure_utils.py` around lines 353 - 355, Add NumPy-style docstrings to the test methods in the specified ranges. For the _mixed_array() method and all other new test methods in the affected ranges, add docstrings that follow NumPy style, which should include a brief summary of what the method does, a Returns section describing what is returned, and any other relevant sections as needed. Each docstring should be placed immediately after the method signature and before any implementation code or comments.

Source: Coding guidelines

marcuscollins

I like this and I think it should ultimately replace the more involved method I previously wrote to handle compositional heterogeneity. But I think there are a couple of problems, mainly that I think this will leave extra atoms (most of which are duplicates of each other and probably have an altloc id that is different, but some will be atoms that are incompatible with the new residue type).

marcuscollins · 2026-06-19T22:12:24Z

+    return None if parent == UNKNOWN_AA else parent
+
+
+def canonicalize_mixed_altloc_residues(


Does this solve the same problem as resolve_mixed_hetatm_atom_altlocs? If so, IIRC AtomWorks inserts an extra residue, and it looks to me like this might not catch that.

Also, if this does solve that same problem, I'd rather we have one solution, not two which could diverge. I'm not saying my solution was better (it is a bit of a hack, tbh, to have to write a temporary CIF file). But before we put this in, we should unify the two (again, if they are the same)

marcuscollins · 2026-06-19T22:16:58Z

+            continue  # single consistent res_name means it doesn't have comp het
+        parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name))
+        if parent is not None and parent != res_name:
+            out.res_name[i] = parent


What happens if the two residues don't have the same atoms? I see a couple issues here. One is that AtomWorks' parse will insert extra atoms when there's compositional heterogeneity. So, for example in 6NI6, you would end up with two CA, C, O, etc... as well as extra atoms that aren't compatible with the residue type anymore. I think probably we need to handle that and canonicalize the atoms, not just the residue names.

marcuscollins · 2026-06-19T22:19:41Z

+        if len(res_names_by_position[position]) == 1:
+            continue  # single consistent res_name means it doesn't have comp het
+        parent = parent_cache.setdefault(res_name, _closest_canonical_amino_acid(res_name))
+        if parent is not None and parent != res_name:


Are you trying to canonicalize all residues, or just make the sequences the same? What if a structure has two different non-canonicals at the same position, and no canonicals?

marcuscollins · 2026-06-19T22:45:02Z

+    # form (e.g. CYS) makes map_altlocs_to_stack fail: the conformers disagree on res_name/hetero
+    # so biotite.stack() rejects them. Canonicalize these residues to get map_altlocs_to_stack to
+    # work
+    ref_struct = canonicalize_mixed_altloc_residues(load_structure_with_altlocs(ref_path))


I think looking at your code, you should try to replace my method with yours although I'm pretty sure you need to get rid of extra atoms still.

If you don't have time, make an issue and tag with engineering.

Copilot AI review requested due to automatic review settings June 16, 2026 22:36

k-chrispens had a problem deploying to gpu-testing June 16, 2026 22:37 — with GitHub Actions Error

Copilot started reviewing on behalf of k-chrispens June 16, 2026 22:37 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread src/sampleworks/eval/structure_utils.py

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

k-chrispens requested a review from marcuscollins June 16, 2026 23:55

marcuscollins requested changes Jun 19, 2026

View reviewed changes

feat(eval): canonicalize mixed-altloc modified residues

4b8cc2a

k-chrispens force-pushed the kmc/canonicalize-altloc-residues branch from 5c4a8a3 to 4b8cc2a Compare June 23, 2026 02:56

k-chrispens requested a deployment to gpu-testing June 23, 2026 02:56 — with GitHub Actions Waiting

		return None if parent == UNKNOWN_AA else parent


		def canonicalize_mixed_altloc_residues(

Uh oh!

Conversation

k-chrispens commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins left a comment

Choose a reason for hiding this comment

Uh oh!

marcuscollins Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

k-chrispens commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading