Fix non-C-contiguous matrix reads in Python bindings by colganwi · Pull Request #2 · raphael-group/LAML-Pro

colganwi · 2026-06-22T18:28:03Z

Problem

The Python bindings silently misread non-C-contiguous NumPy arrays, corrupting the likelihood. numpy_to_character_matrix, numpy_to_observation_matrix, and numpy_to_mutation_priors in python/src/numpy_conversion.cpp indexed the raw buffer assuming a C-contiguous (row-major) layout:

result[i][j] = ptr[i * num_chars + j];   // hardcoded row-major strides

If the input is not C-contiguous — a transposed view, a strided slice, or a pandas DataFrame.values (frequently Fortran-ordered) — the wrong elements are read and the character matrix is effectively transposed/scrambled, with no error raised.

Symptoms

the true (generating) tree scores lower than incorrect trees;
topology_search degrades a good starting tree instead of improving it (e.g. a perfect NJ tree was driven to ~0.74 normalized RF);
permuting the cell→leaf assignment can increase the likelihood;
joint log-likelihood ≠ sum of per-character log-likelihoods, and depends on column order.

Feeding a matrix straight from pandas (df.values) reliably triggers it.

Fixes #1.

Fix

Read the buffers using their actual buf.strides so any memory layout is handled correctly (no copy). Applied to all three converters.

Tests

Added TestMemoryLayoutIndependence in tests/python/test_likelihood.py:

test_c_and_fortran_order_agree — C- vs F-contiguous copies give identical likelihood
test_noncontiguous_view_matches_copy — strided (non-contiguous) view matches a contiguous copy
test_joint_equals_sum_over_characters — joint == sum of per-character likelihoods with heterogeneous alphabet sizes
test_column_order_invariance — likelihood is invariant to column ordering

All four fail on main and pass with this change.

Verification

Full Python test suite passes (tests/python/test_likelihood.py, tests/python/test_em.py): 28 passed.
Regression check on examples/n250_character_matrix is unchanged (it was already C-contiguous): true tree correctly scores highest.
End-to-end on simulated lineage-tracing data (pandas-derived, F-contiguous matrix): with the fix, correctly-aligned data beats a row-shuffled assignment by ~485 log-likelihood units, and topology_search now improves or preserves the NJ tree (0.000 / 0.037→0.000 / 0.000 normalized RF over 3 seeds) instead of degrading it.

🤖 Generated with Claude Code

The numpy->C++ converters (numpy_to_character_matrix, numpy_to_observation_matrix, numpy_to_mutation_priors) indexed the raw buffer assuming a C-contiguous (row-major) layout. Non-contiguous inputs -- a transposed view, a strided slice, or a pandas DataFrame's `.values` (frequently Fortran-ordered) -- were therefore read with the wrong offsets, silently scrambling the matrix and corrupting the likelihood. This made the true tree score lower than incorrect trees and caused topology_search to degrade good trees instead of improving them. Read the buffers using their actual strides so any memory layout is handled correctly, and add regression tests asserting layout independence, joint == sum of per-character likelihoods, and column-order invariance. Fixes raphael-group#1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-C-contiguous matrix reads in Python bindings#2

Fix non-C-contiguous matrix reads in Python bindings#2
colganwi wants to merge 1 commit into
raphael-group:mainfrom
colganwi:fix/noncontiguous-matrix-bindings

colganwi commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

colganwi commented Jun 22, 2026

Problem

Symptoms

Fix

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant