Skip to content

Fix non-C-contiguous matrix reads in Python bindings#2

Open
colganwi wants to merge 1 commit into
raphael-group:mainfrom
colganwi:fix/noncontiguous-matrix-bindings
Open

Fix non-C-contiguous matrix reads in Python bindings#2
colganwi wants to merge 1 commit into
raphael-group:mainfrom
colganwi:fix/noncontiguous-matrix-bindings

Conversation

@colganwi

Copy link
Copy Markdown

Problem

The Python bindings silently misread non-C-contiguous NumPy arrays, corrupting the likelihood. numpy_to_character_matrix, numpy_to_observation_matrix, and numpy_to_mutation_priors in python/src/numpy_conversion.cpp indexed the raw buffer assuming a C-contiguous (row-major) layout:

result[i][j] = ptr[i * num_chars + j];   // hardcoded row-major strides

If the input is not C-contiguous — a transposed view, a strided slice, or a pandas DataFrame.values (frequently Fortran-ordered) — the wrong elements are read and the character matrix is effectively transposed/scrambled, with no error raised.

Symptoms

  • the true (generating) tree scores lower than incorrect trees;
  • topology_search degrades a good starting tree instead of improving it (e.g. a perfect NJ tree was driven to ~0.74 normalized RF);
  • permuting the cell→leaf assignment can increase the likelihood;
  • joint log-likelihood ≠ sum of per-character log-likelihoods, and depends on column order.

Feeding a matrix straight from pandas (df.values) reliably triggers it.

Fixes #1.

Fix

Read the buffers using their actual buf.strides so any memory layout is handled correctly (no copy). Applied to all three converters.

Tests

Added TestMemoryLayoutIndependence in tests/python/test_likelihood.py:

  • test_c_and_fortran_order_agree — C- vs F-contiguous copies give identical likelihood
  • test_noncontiguous_view_matches_copy — strided (non-contiguous) view matches a contiguous copy
  • test_joint_equals_sum_over_characters — joint == sum of per-character likelihoods with heterogeneous alphabet sizes
  • test_column_order_invariance — likelihood is invariant to column ordering

All four fail on main and pass with this change.

Verification

  • Full Python test suite passes (tests/python/test_likelihood.py, tests/python/test_em.py): 28 passed.
  • Regression check on examples/n250_character_matrix is unchanged (it was already C-contiguous): true tree correctly scores highest.
  • End-to-end on simulated lineage-tracing data (pandas-derived, F-contiguous matrix): with the fix, correctly-aligned data beats a row-shuffled assignment by ~485 log-likelihood units, and topology_search now improves or preserves the NJ tree (0.000 / 0.037→0.000 / 0.000 normalized RF over 3 seeds) instead of degrading it.

🤖 Generated with Claude Code

The numpy->C++ converters (numpy_to_character_matrix,
numpy_to_observation_matrix, numpy_to_mutation_priors) indexed the raw
buffer assuming a C-contiguous (row-major) layout. Non-contiguous inputs
-- a transposed view, a strided slice, or a pandas DataFrame's `.values`
(frequently Fortran-ordered) -- were therefore read with the wrong
offsets, silently scrambling the matrix and corrupting the likelihood.

This made the true tree score lower than incorrect trees and caused
topology_search to degrade good trees instead of improving them.

Read the buffers using their actual strides so any memory layout is
handled correctly, and add regression tests asserting layout
independence, joint == sum of per-character likelihoods, and column-order
invariance.

Fixes raphael-group#1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python bindings silently misread non-C-contiguous matrices (e.g. pandas .values), corrupting the likelihood

1 participant