Adding CIF file support#83
Conversation
|
Warning Review limit reached
More reviews will be available in 34 minutes and 54 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR resolves structure files to CIF or PDB paths and threads the resolved path through dataset parsing, preprocessing, embedding generation, and tests. It also updates normalized B-factor computation to reuse already-parsed atoms. ChangesCIF Structure Format Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds mmCIF (.cif) structure support alongside PDB across dataset loading and related utilities, plus tests and fixtures to validate CIF parsing and path selection behavior.
Changes:
- Introduces
_read_structure()insrc/dataset.pyto dispatch PDB vs CIF parsing via biotite, and reuses already-parsed water atoms for normalized B-factor computation. - Updates dataset list parsing to prefer an existing
*_final.cifover*_final.pdbwithout scanning directories. - Adds
utils.resolve_structure_path()and adjustsparse_split_file()to emit preferred CIF paths, with new unit/integration tests and a CIF test fixture file.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/dataset.py |
Adds CIF/PDB dispatch reader, refactors normalized B-factor computation, and prefers CIF in _parse_pdb_list(). |
src/utils.py |
Adds resolve_structure_path() and changes parse_split_file() to emit preferred CIF paths. |
tests/test_dataset.py |
Adds CIF parsing integration tests and dataset list parsing tests for CIF preference and filesystem probing behavior. |
tests/test_utils.py |
Adds unit tests for resolve_structure_path() and verifies parse_split_file() doesn’t touch the filesystem. |
tests/conftest.py |
Adds a cif_6eey fixture resolver. |
tests/test_files/6eey/6eey_final.cif |
Adds a real mmCIF test structure file for integration coverage. |
Comments suppressed due to low confidence (1)
src/utils.py:156
parse_split_file()'s docstring claims it raisesValueErrorwhen the file contains only malformed lines, but the function never raises and will just return an empty list. This mismatch can mislead callers and tests; either implement the exception or remove theRaises:section.
Raises:
ValueError: If split_file contains only malformed lines
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/dataset.py`:
- Around line 50-60: The _read_structure function uses path.endswith(".cif")
which will fail when path is a pathlib.Path object instead of a string. Convert
the path parameter to a string at the beginning of the function using str(path)
before performing the suffix check with endswith(), or alternatively use
pathlib.Path(path).suffix to extract the file extension in a path-agnostic way.
This ensures the function handles both string and Path-like inputs correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c02ff7f-09e1-4727-9242-456916027cae
📒 Files selected for processing (8)
scripts/generate_esm_embeddings.pyscripts/generate_slae_embeddings.pysrc/dataset.pysrc/utils.pytests/conftest.pytests/test_dataset.pytests/test_files/6eey/6eey_final.ciftests/test_utils.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/utils.py:156
- The docstring claims
parse_split_file()raisesValueErrorwhen the split contains only malformed lines, but the implementation never raises (it just returns an empty list). Either implement the raise or remove thisRaises:section to keep docs accurate.
Raises:
ValueError: If split_file contains only malformed lines
"""
marcuscollins
left a comment
There was a problem hiding this comment.
There are several things that could be improved, particularly in the tests. Please have a look and see if you can make some changes.
| return ins | ||
|
|
||
|
|
||
| def resolve_structure_path(path: str | Path) -> Path | None: |
There was a problem hiding this comment.
This function seems a little superfluous. I don't think you need to look for a different file extension if the specified one isn't there. Just raise an error.
There was a problem hiding this comment.
Done, created a single stuc_path that should be a cif or pdb, and errors out in the parse function if the file does not exist. Made this change wherever a "pdb_path" is invoked.
There was a problem hiding this comment.
You can actually iterate over the file directly by using for line in open(split_file, "r"):, it will manage the file closing for you.
There was a problem hiding this comment.
That said, you should just create a csv or tsv file with the pieces you need. E.g., lines like:
1VME, /path/to/1VME.cif, <<cache_key>>
and then read them in with pandas or something similar.
There was a problem hiding this comment.
Made the change for the loop. As for the second comment, do you mean my input split files should be formatted that way? Since the path and cache keys can change depending on the directory we specify, hence I chose to go with a simple text file, and we can specify where to find the structure files (path dir) and where to create the cache.
There was a problem hiding this comment.
Simplify this: parts = line.strip().split("_")
|
|
||
| def test_cif_hydrogen_removed(self, cif_6eey): | ||
| """Hydrogens should be removed from CIF-parsed arrays.""" | ||
| protein_atoms, water_atoms = parse_asu_with_biotite(cif_6eey) |
There was a problem hiding this comment.
These could be a test fixtures since they are re-used many times. It would speed up your tests.
| base_dir = tmp_path / "pdbs" | ||
| self._write_structure(base_dir, "abcd", [".cif", ".pdb"]) | ||
|
|
||
| list_file = tmp_path / "list.txt" |
There was a problem hiding this comment.
can you be sure that the tmp directory is cleaned up?
There was a problem hiding this comment.
Yes, pytest tmp_path are cleaned up by default
| base_dir / "keep2" / "keep2_final.cif" | ||
| ) | ||
|
|
||
| def test_parse_does_not_probe_filesystem(self, tmp_path, monkeypatch): |
There was a problem hiding this comment.
This is a weird test. Is there some reason we care about this? Remove it if there isn't some very specific concern.
| ) | ||
|
|
||
| assert [entry["pdb_id"] for entry in dataset.entries] == ["keep1", "keep2"] | ||
| assert dataset.entries[0]["pdb_path"] == ( |
There was a problem hiding this comment.
This is testing a couple different things:
- which ids are retained and which are discarded
- a specific set of paths and file name conventions.
In general a unit test should really test one thing. To achieve that here, you could create a method that generates the file paths from the pdb id, and then test that method separately.
| """Tests for lazy CIF/PDB path resolution helpers.""" | ||
|
|
||
| @staticmethod | ||
| def _write_structure(base_dir: Path, pdb_id: str, suffixes: list[str]) -> None: |
There was a problem hiding this comment.
A lot of these tests are redundant with some in test_dataset.py. That just adds time to your test runs. For instance, a lot of the tests above are implicitly testing resolve_structure_path. Please go through these and minimize them. You can use a test coverage report to determine whether you've decreased your coverage when you remove any tests.
There was a problem hiding this comment.
Addressing both this and the above comment, I've tried to remove some redundant tests, and consolidate as much as I can while minimizing coverage.
|
|
||
| parts = line.split("_") | ||
| if len(parts) < 2: | ||
| for line in open(split_file, "r"): |
There was a problem hiding this comment.
@marcuscollins I'll defer a decision on this to you -- is this needed?
dataset.pyandutils.pynow support reading.ciffiles alongside.pdb:_read_structure()dispatches to biotite's CIF or PDB reader based on file extension, used byparse_asu_with_biotiteand compute_normalized_bfactors.compute_normalized_bfactorsno longer re-reads the file when bfactors are already available fromparse_asu_with_biotite; the shared logic moves to_compute_normalized_bfactors_from_atoms()._parse_pdb_listnow prefers a PDB ID's.ciffile over.pdbwhen both exist (checked via a singleos.path.isfile, no directory scan).utils.resolve_structure_path()resolves a preferred CIF/PDB path to whichever file actually exists on disk;parse_split_file()now returns preferred.cifpaths for callers to resolve lazily.Summary by CodeRabbit