Fix non-C-contiguous matrix reads in Python bindings#2
Open
colganwi wants to merge 1 commit into
Open
Conversation
The numpy->C++ converters (numpy_to_character_matrix, numpy_to_observation_matrix, numpy_to_mutation_priors) indexed the raw buffer assuming a C-contiguous (row-major) layout. Non-contiguous inputs -- a transposed view, a strided slice, or a pandas DataFrame's `.values` (frequently Fortran-ordered) -- were therefore read with the wrong offsets, silently scrambling the matrix and corrupting the likelihood. This made the true tree score lower than incorrect trees and caused topology_search to degrade good trees instead of improving them. Read the buffers using their actual strides so any memory layout is handled correctly, and add regression tests asserting layout independence, joint == sum of per-character likelihoods, and column-order invariance. Fixes raphael-group#1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The Python bindings silently misread non-C-contiguous NumPy arrays, corrupting the likelihood.
numpy_to_character_matrix,numpy_to_observation_matrix, andnumpy_to_mutation_priorsinpython/src/numpy_conversion.cppindexed the raw buffer assuming a C-contiguous (row-major) layout:result[i][j] = ptr[i * num_chars + j]; // hardcoded row-major stridesIf the input is not C-contiguous — a transposed view, a strided slice, or a pandas
DataFrame.values(frequently Fortran-ordered) — the wrong elements are read and the character matrix is effectively transposed/scrambled, with no error raised.Symptoms
topology_searchdegrades a good starting tree instead of improving it (e.g. a perfect NJ tree was driven to ~0.74 normalized RF);Feeding a matrix straight from pandas (
df.values) reliably triggers it.Fixes #1.
Fix
Read the buffers using their actual
buf.stridesso any memory layout is handled correctly (no copy). Applied to all three converters.Tests
Added
TestMemoryLayoutIndependenceintests/python/test_likelihood.py:test_c_and_fortran_order_agree— C- vs F-contiguous copies give identical likelihoodtest_noncontiguous_view_matches_copy— strided (non-contiguous) view matches a contiguous copytest_joint_equals_sum_over_characters— joint == sum of per-character likelihoods with heterogeneous alphabet sizestest_column_order_invariance— likelihood is invariant to column orderingAll four fail on
mainand pass with this change.Verification
tests/python/test_likelihood.py,tests/python/test_em.py): 28 passed.examples/n250_character_matrixis unchanged (it was already C-contiguous): true tree correctly scores highest.topology_searchnow improves or preserves the NJ tree (0.000 / 0.037→0.000 / 0.000 normalized RF over 3 seeds) instead of degrading it.🤖 Generated with Claude Code