Skip to content

Catch-all STR detection + Exp1 de-novo sensitivity overhaul (recall 23%→80.7%, ULTRA-level)#6

Open
wyim-pgl wants to merge 31 commits into
mainfrom
perf/exp1-human-sensitivity
Open

Catch-all STR detection + Exp1 de-novo sensitivity overhaul (recall 23%→80.7%, ULTRA-level)#6
wyim-pgl wants to merge 31 commits into
mainfrom
perf/exp1-human-sensitivity

Conversation

@wyim-pgl

Copy link
Copy Markdown
Owner

Summary

This branch overhauls BWTandem's de-novo tandem-repeat sensitivity and adds a new
catch-all periodicity algorithm that closes the recall gap to ULTRA/tantan while
keeping BWTandem the only de-novo tool that also maxes out satellite detection.

Headline (Human GRCh38 vs full adotto v1.2.1): region recall 23% → 80.7% at
79.1% adjusted precision — de-novo recall now on par with ULTRA (81.6%), which no
masking tool reaches on satellite.

@framazan — this is for your cross-tool benchmark. A self-contained run + results guide
is in docs/2026-06-25-catch-all-benchmark-for-filip.md
(build, exact per-genome commands, op-point table, scoring, ops notes).

Note: wyim-pgl/bwt-algorithm and framazan/bwtandem don't share git history (this
repo was created fresh, not forked), so GitHub can't open a PR directly into
framazan/bwtandem. This PR (into wyim-pgl/bwt-algorithm:main) is the reviewable
diff + writeup; the repo is public so you can read/comment here. If you want it landed
in framazan/bwtandem, the simplest path is cherry-picking these commits or granting
push access.

The new algorithm — catch-all periodicity pass (opt-in, default OFF)

finder._catchall_periodicity_fill (env CATCHALL_SCAN=1) detects local periodicity
directly in the DNA no tier/satellite pass covered — recovering the entirely-missed
diverged short STRs that the seed-based tiers structurally cannot find (a substitution
in most copies forms no exact seed). It trades precision for recall via a tunable
CATCHALL_MIN_IDENTITY, and a CATCHALL_MIN_COPIES=3 gate recovers precision by dropping
the shortest 2-copy over-calls. Default OFF — every existing run is unchanged.

3-species benchmark (catch-all OFF → ON)

species metric OFF ON verdict
Human GRCh38 (adotto) region recall / adj prec 57.6% / — 80.7% / 79.1% headline win
Arabidopsis Col-CEN CEN180 monomer recall 99.67% 99.68% neutral (saturated)
Maize Mo17 3A microsatellite bp 11.6M 22.7M (+96%) win
Maize Mo17 3B/3C satellite (knob180/TR-1/CentC) 25·17·17 25·17·17 unchanged

Rule: catch-all is a short-STR / microsatellite mechanism (period ≤ 20 bp) — ON for
STR/microsatellite runs, OFF for satellite/centromere runs (those use 156–360 bp periods,
outside its range). Full tables with all competitor tools are in the docs page above and
exp1_human/filip_repro/benchmarking_results_updated.md.

How to run

export TIER1_FMSCAN=1 TIER1_FMSCAN_MIN_DENSITY=0.45 TIER1_FMSCAN_MIN_LLR=6.0 \
       TIER1_MIN_ARRAY_LEN=20 TIER1_MIN_SCORE=20 TIER1_MIN_COPIES=2 \
       TIER1_COPYBASE=6 TIER1_COPYADD=2 TIER1_EXT_COPIES=2 \
       TIER1_SHORT_PERIOD_MAX=9 TIER1_SHORT_MIN_ARRAY_LEN=17 TIER1_SHORT_MIN_SCORE=17 \
       TIER2_MISMATCH=0.30 TIER2_SHORT_REQ_COPIES=2 \
       CATCHALL_SCAN=1 CATCHALL_MIN_IDENTITY=0.72 CATCHALL_MIN_COPIES=3   # catchF op-point
python3 -m src.main genome.fa --min-period 1 --max-period 2000 --threads 4 --format bed -o out -v

What else is in the branch (de-novo sensitivity overhaul)

  • Tier1 FM-index diverged-STR detector + rolling-consensus mismatch extender +
    period-stratified length/score gate (the recall/precision wall break, 23%→44%).
  • Satellite interior gap-fill for divergent alpha-satellite arrays (CHM13 centromere
    bp-recall 92.5%→99.8%).
  • Tier2 autocorrelation seeder for diverged period-10–20 arrays (opt-in).
  • Memory: full-genome peak 149 GB → ~30 GB (~5×).
  • Refactor: the 3 duplicated autocorrelation routines consolidated into
    src/autocorr.py (behavior-preserving — catch-all chr21 output byte-identical
    before/after; verified).

Testing

pytest tests/ — incl. test_autocorr.py (12, helpers reproduce the inline math exactly),
test_ground_truth.py (11), test_satellite_gapfill.py, test_loop_p1020.py. All green
with the compiled _accelerators.so. Defaults reproduce baseline behavior exactly.

Ops notes (Pronghorn)

Full-genome MaxRSS is human 37.6 GB / maize 23.7 GB — request 64 GB (a 190 GB request
stalled ~12 days in the full PGL partition; SLURM allocates by request, not usage).
scontrol is broken on the compute nodes (libreadline.so.6) — use scancel+resubmit.

🤖 Generated with Claude Code

https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP

wyim-pgl and others added 30 commits June 20, 2026 13:04
Document the verified non-Docker Linux install: conda-forge compiler env,
the channel-priority solver-hang workaround (--override-channels), and the
required Cython/C extension build (without it the tool reports 0 repeats).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Design/spec for closing BWTandem's adotto recall gap (36.87% vs
ultra/tantan 75-81%) on the general-TR benchmark. Records the verified
diagnosis (real short-STR sensitivity gap, not a naming artifact), the
agreed strategy (parameter sweep first, then surgical code changes;
re-score all tools on a downloaded public adotto catalog), and the
six-step plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose Tier1's hardcoded detection thresholds as env-var overrides so the
short-tandem-repeat sensitivity/precision trade-off can be swept without
recompiling. Defaults reproduce the original behaviour exactly (verified: a
fresh chr21 run with no env vars matches the published baseline — 29.50%
adotto region recall, 79.50% precision).

New knobs (TIER1_*): MIN_COPIES, MIN_ARRAY_LEN, MIN_SCORE, COPYBASE, COPYADD
(dynamic_min_copies formula), EXT_COPIES (perfect seed copies before mismatch
extension), MISMATCH, MIN_ENTROPY.

Context (#3, Human GRCh38 general-TR): the recall gap vs ultra/tantan is
concentrated in short periods (<=20 bp, 85% of adotto regions). On chr21,
TIER1_MIN_ARRAY_LEN=20 TIER1_MIN_SCORE=20 (+relaxed copies) lifts overall
adotto region recall 29.50% -> 50.76% while keeping adjusted precision
(call in adotto OR confirmed by ultra/tantan) at 85.1% and raw precision
(67.7%) still above ultra (54.2%).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tier1: add an opt-in low-complexity entropy gate (TIER1_ENTROPY_GATE,
default off). Empirically it does NOT help on short STRs — it removes real
low-entropy repeats (AT/AAT-type) and recall drops without precision gain,
because spurious and genuine short STRs are intrinsically indistinguishable.
Kept as a documented, default-off option.

tier2: expose the simple-scan detection thresholds for the period-10-20
regime as env vars (TIER2_MIN_COPIES, TIER2_MIN_ARRAY_LEN,
TIER2_SHORT_REQ_COPIES, TIER2_MISMATCH). Defaults reproduce the original
behaviour exactly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tuned config (comboA) lifts adotto region recall 29.5->52.3% (chr21) and
29.8->55.7% (chr22) while keeping adjusted precision ~85% and raw precision
above ultra/tantan. Documents the diagnosis (gap concentrated in period <=20),
the failed entropy-gate experiment, the Tier2 period-10-20 contribution, and
the honest ceiling. Full-genome SLURM validation pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 'Tuning detection sensitivity (env vars)' section (TIER1_*/TIER2_*
thresholds, all defaulting to original behaviour) and a note that only
_accelerators.pyx edits need a Cython rebuild — pure-Python tier edits do not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full GRCh38 primary (chr1-22,X,Y) validation of comboA (SLURM job 5713685,
2h44m, 1,528,753 calls). All tools re-scored identically vs the full adotto
v1.2.1 catalog (1,784,804 GT regions):

- adotto region recall 23.03% -> 44.36% (+21.3 pp, ~1.93x); bp recall
  22.34% -> 30.60% — the chr21/22 ~2x gain generalises genome-wide.
- improved bwtandem overtakes trf on recall (44.36% vs 31.88%) and stays the
  most precise de novo caller (raw 66.41% > ultra 53.65%, tantan 57.66%).
- adjusted precision 88.4% -> 82.9%; of the new out-of-catalog calls 49.1% are
  ultra/tantan-confirmed, so genuine over-calling is only ~17% of calls.
- honest ceiling confirmed: still below the ultra/tantan 78-82% recall band.

Memory constraint NOT met: peak ~149 GB RSS at --threads 4 (vs 41.98 GB
reference), dominated by 4 concurrent per-chromosome FM-indexes; reduction
(plan item #5 / lower --threads) remains open. Recall result stands independently.

Replaces the Pending section with the final genome-wide table + reproduce steps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BWTCore built two O(n) Python structures that nothing reads:
- kmer_hash: its only consumer get_kmer_positions() is never called (and falls
  back to the FM-index anyway), yet __init__ built a dict of per-position Python
  int lists for the whole chromosome.
- sampled_sa: its only consumer _get_suffix_position() is never called;
  locate_positions() reads suffix_array directly. With the hardcoded
  sa_sample_rate=1 (finder.py) this duplicated the entire suffix array as a
  Python int->int dict.

Both grew O(n) in Python objects and dominated per-chromosome RAM. Setting them
empty cuts chr21 single-thread peak RSS 8.60 GB -> 1.37 GB (6.3x) with the
adotto metric unchanged: region recall 52.33%, precision 65.78%, bp recall
42.83% are identical across 3 old vs 3 new runs. (The tool's BED already varies
~0.1% run-to-run -- old vs old differ too -- and new output stays within that
band, so exact-diff was never the right oracle; metric preservation is.)

get_kmer_positions() still works (empty hash -> FM-index fallback). Addresses
the Exp1 #3 memory item; full-genome re-measurement to follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-ran the full GRCh38 primary genome with the dead-structure fix (40d4f2d) at
--threads 4 (SLURM job 5713697, 2h37m). Peak RSS dropped 149.18 GB -> 29.63 GB
(5.0x), now well under the 41.98 GB reference target, and ~7 min faster.

Recall is preserved exactly: re-scoring bwt_hg38_lowmem vs the full adotto
catalog gives identical region recall 44.36%, precision 66.41%, bp recall
30.60% (the BED varies ~0.1% run-to-run independent of this change).

Updates the Memory section from "NOT met" to "MET" with the before/after table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Approved design (brainstorming output) for pushing bwtandem recall toward/past
tantan (78%) and adding a CHM13 centromere/satellite strength. Grounded in the
genome-wide recall-gap diagnosis: gap is short-period STRs (period 1-3 = 38.7%
of recoverable, within-class recall 26.6%), driven by copy-count gating and
exact-only Tier1 seeding, not purity; tuning alone caps at ~55% recall.

Phase 1 (genome core, staged): C1 mismatch-tolerant short-STR seeding, C2
rolling-consensus extender (bp-precision), C3 copy-gate relaxation, C4 period
10-20 path, C5 merge relaxation, C6 re-sweep + genome validation. Phase 2:
CHM13 multi-anchor Tier3 satellite seeding + satellite-stratified scoring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TDD-structured plan for the genome-core recall changes: C1 mismatch-tolerant
short-STR seeding (stitch perfect sub-runs), C2 rolling-consensus extender,
C4 period 10-20 path, C5 merge + purity-gated 2-copy, C6 re-sweep + genome
validation. Synthetic-STR behavioral tests + real-data scoring gates per step.
Phase 2 (centromere/CHM13) deferred to its own plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…get)

Adds tests/test_maxrecall_seeding.py — a standalone (no pytest) behavioral
test that encodes the recoverable-miss class for short STRs: period 1-3,
82-92% purity, 7-15 copies.  Fixed seed (42) makes results deterministic.

On the comboA baseline (TIER1_MIN_ARRAY_LEN=20 TIER1_MIN_SCORE=20) exactly
one case FAILs:
  FAIL  mono_A_15cop_08mm  (15 bp < required_threshold=20)
With the bare defaults (MIN_ARRAY_LEN=26, MIN_SCORE=30) three FAILs.
These are the misses that C1 (mismatch-tolerant seeder) and C2 (rolling
extender) will fix in later tasks.

Also records chr21 comboA baseline: regRecall=52.33%, regPrec=65.78%.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… recall (#3)

Short perfect STR cores (e.g. a 7-copy dinucleotide = 14 bp) frequently sit
inside a much larger adotto region but were rejected by Tier 1's global
acceptance gate (required_threshold = max(MIN_ARRAY_LEN, ...) and the MIN_SCORE
length*purity gate). This dominated the chr21 short-period (1-9) recall gap.

Add a period-stratified relaxation behind new env knobs, all defaulting to
current behaviour (unset == baseline, exactly reproduces c0):

  TIER1_SHORT_PERIOD_MAX     (default 0 = disabled)
  TIER1_SHORT_MIN_ARRAY_LEN  (default = TIER1_MIN_ARRAY_LEN)
  TIER1_SHORT_MIN_SCORE      (default = TIER1_MIN_SCORE)

When motif_len <= SHORT_PERIOD_MAX the gate uses the relaxed short floors;
longer motifs keep the strict global gate so long-repeat precision is
unaffected. Also add opt-in stitch-seeding (TIER1_STITCH_GAP, default 0)
which merges phase-aligned adjacent perfect sub-runs of the same period.

Measured (full pipeline, real adotto scoring), recommended operating point on
top of comboA: SHORT_PERIOD_MAX=4 SHORT_MIN_ARRAY_LEN=18 SHORT_MIN_SCORE=18
  chr21: region recall 52.33% -> 56.69% (+4.36pp) at precision 59.17%
         p1-9 recall 45.44% -> 50.85% (+5.41pp)
  chr22: region recall 55.69% -> 59.58% (+3.89pp) at precision 65.71%
         p1-9 recall 47.82% -> 52.86% (+5.04pp)
Both clear the +3pp recall target and stay above tantan's 57.66% precision
floor. Stitch-seeding tested but ~neutral on short-period recall (+0.2pp).

Rewrite tests/test_maxrecall_seeding.py into a real regression check for the
gate: a 15 bp mono-A core is rejected at baseline, accepted when
SHORT_PERIOD_MAX>=1 with relaxed floors, and rejected again when
SHORT_PERIOD_MAX=0 (proving the period stratification, not just lower floors,
gates the relaxation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed arrays (#3)

The original `extend_with_mismatches` compares every copy to a FIXED seed copy
and breaks on CUMULATIVE mismatch, so diverged STR arrays truncate early
(median ~47.6% coverage of the true adotto span, bp-precision ~28%). This adds
an opt-in rolling extender that:

- maintains a per-position MAJORITY consensus (vote table) updated as copies are
  accepted, comparing each new copy to the current consensus rather than a fixed
  seed copy;
- breaks only after K consecutive "bad" copies (per-copy mismatch fraction over
  the allowed rate), extending in BOTH directions and trimming trailing bad
  copies so the call ends on a good copy;
- drops the period==1 zero-tolerance special case in `_max_mismatch_threshold`
  so homopolymers get the normal allowed rate.

Env-gated for revertibility and baseline preservation (read once at module
load): TIER1_EXT_ROLLING (default 0 = original fixed-consensus cumulative
behaviour, byte-for-byte; 1 = new rolling/windowed) and TIER1_EXT_BAD_RUN
(K consecutive bad copies, default 2, only used when ROLLING=1). Function
signature is unchanged so tier1.py / tier2.py / bwt_seed.py need no edits.

Adds tests/test_maxrecall_extender.py: with ROLLING=1 (+ Task1 period-stratified
flags) Tier1's best call covers >=90% of an implanted diverged (AC)x40 @90%
array; the default (ROLLING unset) path still runs without crashing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The windowed "K consecutive bad copies" break alone runs away on real
low-complexity / homopolymer DNA: for period 1 a single-base copy can never
exceed per_copy_max, so bad_run never increments and the extension walks the
whole array; the rolling consensus also drifts and tolerates non-periodic
sequence. On chr21 this over-extended period-1/2 calls across long stretches,
which both hurt precision and exploded downstream refine_repeat DP cost (full
chromosome did not finish in ~70 min; tier1-only on a 1 Mb slice went from 3 s
to many minutes).

Add a cumulative-purity guard mirroring the old path's _total_mismatches cap but
measured against the rolling consensus: stop extending in a direction once the
running mismatch fraction over the accepted span would exceed the allowed rate
(with +1 grace so a lone SNP in a short array is not cut). This preserves the
full-span extension of genuinely periodic diverged arrays (perfect/diverged
(AC)xN still reach their true ends) while preventing runaway in random DNA
(random p1/p2 now stop at span ~3-4 instead of the whole array). Tier1-only on
the 1 Mb slice is back to ~3 s (parity with rolling OFF).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion wall)

New opt-in Tier-1 source (TIER1_FMSCAN, default OFF) that recovers diverged
short-period STRs (p1-9) the exact adjacent-copy seeder misses, AT high
precision -- something gate-relaxation could not do.

Method: enumerate primitive p-mers, locate all FM-index occurrences, detect
gap-tolerant periodic runs (missing copies = diverged copies absorbed), and
accept only runs that clear a DENSITY floor (observed/expected copies) and a
Poisson LOG-LIKELIHOOD-RATIO vs i.i.d. background. This statistical filter is
the discriminator crude length/score gates lack: it separates real-but-diverged
arrays from random low-complexity DNA. Longest-period-first + a greedy seen_mask
report each array once by its best motif/phase, keeping candidate counts low.

Additive mode (=1) runs on the residual the sliding-window scan did not claim,
so it only ADDS missed diverged STRs.

Validated dominance over the gate-tuning OP1 (base env), adotto + ultra/tantan
adjusted precision:
  chr21: 52.33 -> 58.97 region recall at equal precision (raw 65.8, adj 83.1->83.6)
  chr22: 55.69 -> 61.60 region recall at equal precision (raw 70-72, adj ~86)
  short p1-9 recall: +8.4 (chr21) / +7.3 (chr22); +1314 p1-9 regions on chr21.

Default OFF preserves baseline byte-for-byte. See
docs/2026-06-22-exp1-fmscan-diverged-str-detector.md for the frontier table,
runtime, and tuning knobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… flags

int()/float() crash on an empty string, so an exported-but-empty env var
(e.g. TIER1_FMSCAN="") would abort tier1 init. Switch the NEW opt-in flags
(TIER1_FMSCAN*, TIER1_SHORT_PERIOD_MAX, TIER1_SHORT_MIN_ARRAY_LEN,
TIER1_SHORT_MIN_SCORE) to `os.environ.get(...) or "<default>"` so empty and
unset both fall back to the documented default. Valid values are unchanged.

Verified: tests/test_maxrecall_seeding.py passes; constructing Tier1STRFinder
with all new flags set to "" yields the exact documented defaults.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tier2 long-unit "Phase A (SA+LCP)" segfaulted on CHM13 centromeric
satellite arrays (alpha-HOR chrX, HSat3 chr9). faulthandler + ASan
pinpointed the root cause in the per-copy banded-DP aligner, not the
LCP/extend path that was first suspected.

Root cause: int32 overflow in the traceback-table sizing/indexing of
both aligners. On a satellite, Phase A's LCP scan emits spurious
candidate periods up to ~max_period (e.g. 98,728 bp). refine_repeat()
then aligns a ~99 kb "motif", so the full (m+1)x(w+1) DP table is
~1e10 cells. That product was computed in 32-bit int:
  - C  align_accel.c:align_one          malloc((m+1)*cols) and i*cols
  - Cython align_unit_to_window         malloc(rows*cols)  and i*cols
Both wrapped past INT32_MAX to a small/negative size, malloc returned
an undersized buffer, and the DP fill wrote far out of bounds ->
heap corruption -> SIGSEGV. Proven with AddressSanitizer:
"signed integer overflow: 98729 * 108602 ... heap-buffer-overflow
WRITE ... in align_one".

Fix (root cause, not a mask):
- align_accel.c / _accelerators.pyx: compute table size and every
  i*cols index in 64-bit (size_t/long); add a sane O(m*w) table-size
  cap (256 M cells, ~16 kb motif) so an absurd pseudo-motif is skipped
  (return 0/None) instead of overflowing or allocating ~10 GB. Adds
  the missing malloc-failure guards. No behaviour change for any motif
  size that worked before.
- tier2.py: skip Phase A per-copy DP for periods above a tunable bound
  (TIER2_LONGUNIT_DP_MAX_PERIOD, default 8192). On satellites the huge
  spurious LCP-multiple periods yield no valid repeat yet caused an
  ~30 GB / OOM blow-up; long real units are recovered by Phase B and
  Tier 3. This makes the arrays actually run to completion.

Verification:
- chrX alpha-HOR (3.15 Mb): completes, no SIGSEGV, Phase A "found 109",
  88 repeats, 92.8% bp-recall / 100% precision, 308 s, 0.35 GB peak.
- chr9 HSat3 (3.0 Mb): completes, no SIGSEGV, Phase A "found 202",
  21,496 repeats, 71.8% bp-recall / 100% precision, ~25 min, 9.7 GB.
- Regression: Arabidopsis ChrC and ChrM (tier1,tier2,tier3) BED output
  byte-identical before vs after; 30k-case differential of the C
  aligner vs the pre-fix build matches except for inputs that hit the
  aligner's pre-existing uninitialised-column-0 UB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_fill_satellite_gaps left large interior gaps in centromeric HOR arrays
uncovered: ~93% of CHM13 bp-recall misses were array-interior, not edge.
Three gates blocked the backstop from filling divergent alpha-satellite:

  - satellite anchors gated to motif 100-300bp, so the large HOR-unit calls
    (motif 2057-6171bp) flanking a gap never marked it near-satellite
  - identity floor 0.55 rejected divergent arrays (~0.46-0.54 monomer identity)
  - autocorrelation period window 100-300 excluded the dimeric HOR period
    (2x171=342) where the most divergent array (chr3) peaks

All three are now env-overridable (SAT_FILL_MIN_IDENTITY=0.45,
SAT_FILL_MIN/MAX_PERIOD=100/360, SAT_ANCHOR_MIN/MAX_MOTIF=100/0) with the
improved values as defaults.

CHM13 6-array benchmark: mean bp-recall 92.5%->99.8% (per-array 99.5-99.99%,
was 90.6-95.8%), mean bp-precision held/improved 94.2->94.6%. chr3 (hardest)
91.06->99.50%. Neutral on the chr21/chr22 GRCh38 adotto short-STR benchmark
(region recall/precision flat within 0.06pp). Regression test added with a
random-gap negative control.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WsKubE578JXkjsVTL4MPo2
Add an Environment bullet to Dependencies: environment.yml defines the `bwt`
conda env (pytest + competitor tools trf/tantan/tidehunter/ncrf/mreps),
pydivsufsort needs --no-build-isolation, and the benchmark scripts call a
separate prebuilt interpreter than the one in environment.yml.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WsKubE578JXkjsVTL4MPo2
Staged config→code loop to push adotto region recall toward 82% while holding
precision >= 57.66% (tantan floor). Builds on 2026-06-21 max-recall diagnosis;
remaining levers = period 10-20 (C4), FM-scan p7-9, merge (C5). chr21/22 proxy
+ full-genome validation, /loop self-paced, ledger-based memory, sbatch-only
(login node 4GB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
)

The period 10-20 notch: Tier2's exact-seeding paths (LCP plateau >=8bp,
exact k-mer recurrence) find ZERO seeds in arrays carrying a substitution in
every copy, so diverged short/medium STRs (the largest adotto recall gap vs
ULTRA: 59% -> 89%, ~358k regions) are missed entirely.

Phase C (opt-in TIER2_APPROX_SEED, default off = baseline unchanged) detects
local periodicity directly: per period p, vectorized identity between
offset-p windows (cumsum sliding window, O(n)); seed each high-identity run at
its max-identity interior position; extend + refine for boundaries. Fixes vs a
naive version: (1) seed at run argmax not boundary (clean consensus); (2) motif
from the interior seed copy not the over-extended full_start; (3) trim-retry
refine spans inward by one period when the extender creeps into flank; (4) build
Phase C exclusion from tier1 + accepted results only, ignoring Phase B's
speculative covered-mask marks that otherwise hide whole arrays.

Test: tests/test_loop_p1020.py — a diverged p13 array baseline misses entirely
(0bp) is recovered to 519/520bp. Regression: 14 passed (Tier1/2/3 ground-truth
+ satellite gap-fill), baseline byte-unchanged with the flag off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
The first seeder version hung on real chromosomes: low-complexity DNA clears the
identity threshold over millions of single windows, and np.split + a per-seed
extend call on each made Phase C effectively non-terminating (>31 min on chr21).

Rewrite: vectorized contiguous-run extraction over the identity mask; a
min-run-length filter (>= period window-starts) drops single-window noise; array
bounds are taken directly from the run extent (no per-seed extend_with_mismatches);
the refine start is phase-aligned to the seed copy (an unaligned start lands
mid-copy and refine_repeat fails to lock on); per-period seed cap
TIER2_APPROX_MAX_SEEDS (default 200000) bounds the worst case.

Result on chr21: Phase C 31 min -> 29.7 s, recovering 1688 diverged period-10-20
arrays the exact-seed paths miss. Synthetic test still recovers 519/520 bp;
regression 14 passed; baseline unchanged with TIER2_APPROX_SEED off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…8% precision

The 2026-06-23 maximization loop's validated full-genome adotto operating point
(v2.1): comboChi base + TIER2_MISMATCH=0.30 + TIER1_FMSCAN_MIN_DENSITY=0.45 +
TIER1_FMSCAN_MIN_LLR=6.0. Region recall 57.62->59.04 (+1.42pp), bp recall
35.11->39.42 (+4.31pp), region precision 58.98 (holds tantan's 57.66% floor —
bwtandem stays the de-novo precision leader). 82% region recall is unreachable at
>=57.66% precision (ULTRA reaches 82% only at 53.65%; env-lever ceiling ~65%
proxy at the floor, gate-lowering collapses precision).

The period-10-20 autocorrelation seeder (29ab962) is documented opt-in/default-off
for adotto (region recall flat, bp precision 31->13.5% from low-complexity
over-call); kept for potential satellite use.

CLAUDE.md tuning section + benchmarking_results_updated.md Exp1 row updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
The 3-tier pipeline only emits where it can form a seed, so diverged short STRs
(a substitution in most copies) are missed entirely — the dominant region-recall
gap vs ULTRA/tantan. _catchall_periodicity_fill (opt-in CATCHALL_SCAN, default
off = baseline unchanged) is a low-stringency autocorrelation sweep over short
periods (CATCHALL_MIN_P..MAX_P, default 1-20) in the regions NO tier/satellite
pass covered: for each period it computes vectorized identity between offset-p
windows (cumsum sliding window, O(n)), extracts high-identity runs above
CATCHALL_MIN_IDENTITY (default 0.72) in uncovered DNA, and emits a call per run
(longest period first so genuine k-mer STRs claim their region). It deliberately
trades precision for recall — the identity/length knobs (CATCHALL_MIN_IDENTITY,
CATCHALL_MIN_LEN) set the operating point for the ULTRA-level (~53% precision)
target.

chr21: +26978 calls in 16 s (52560 total), EXIT=0. Regression 14 passed;
baseline byte-unchanged with CATCHALL_SCAN off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…ULTRA)

PHASE 3 of the recall loop: with the precision target relaxed to ULTRA's level
(~53%), the new catch-all periodicity algorithm (CATCHALL_SCAN, finder.py 15d4273)
breaks the recall ceiling. Validated full-genome adotto frontier on the v2.2 gate
base (v2.1 env + period-9 gate floors 17 + CATCHALL_SCAN=1):
  - identity 0.76 (+MAX_P 50): 72.88% recall / 52.72% precision  (~ULTRA precision,
    +15.3pp recall over v2's 57.62%)
  - identity 0.72: 82.35% recall / 47.99% precision  (beats ULTRA's 81.62% recall)
  - identity 0.68: 84.42% recall / 42.61% precision  (max recall)
bwtandem now reaches/exceeds ULTRA-level recall; frontier close to but slightly
inside ULTRA's 81.62/53.65 corner. bp precision (~32%) is the cost (low-complexity
over-call); a post-call quality filter is the next precision lever. v2.1
(59.04/58.98) stays the precision-leader op-point; catch-all is opt-in/default-off.

CLAUDE.md tuning + benchmarking_results_updated.md Exp1 updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
Add two intrinsic filters inside _catchall_periodicity_fill to recover the
~32% bp / ~48% region precision the catch-all trades for recall, without a
re-run:
- CATCHALL_MIN_COPIES (default 2 = no-op; window=2*period already forces
  copies>=2). Raising to 3 drops the shortest 2-copy over-calls.
- CATCHALL_MIN_ENTROPY (default 0 = off; per-base Shannon entropy via new
  _seq_shannon_entropy) drops homopolymer-ish low-complexity calls.

Defaults reproduce baseline exactly. chr21+chr22 proxy sweep (f_* variants)
picks CATCHALL_MIN_COPIES=3 as the op-point: pool precision 50.00%->51.57%
(+1.57pp) at pool recall 84.05%->82.73% (held >=82%). Entropy gate gave ~0
precision gain on adotto region precision, left off.

Regression: test_ground_truth (11), test_loop_p1020, test_satellite_gapfill
all pass (CATCHALL off = baseline).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
finder._fill_satellite_gaps, finder._catchall_periodicity_fill and
tier2._autocorr_seed each independently re-derived the same "self-similarity at
offset p" math; the catch-all and tier2 seeders carried a verbatim ~15-line
cumsum-windowed-count + int8-view run-extraction block. Factor the period math
into src/autocorr.py:
- autocorr_identity(arr, period)            -> scalar (satellite gap-fill)
- windowed_match_counts(arr, period, window)-> O(n) sliding window (catch-all, tier2)
- contiguous_true_runs(mask)                -> run starts/ends (catch-all, tier2)

Behavior-preserving by construction (same numpy ops). Verified:
- tests/test_autocorr.py (12) asserts each helper reproduces the exact inline
  block (cumsum int64, int8-view diff) + brute-force identity.
- chr21 catch-all output byte-identical before/after (13886 calls, 0 diff).
- test_ground_truth (11) / test_loop_p1020 / test_satellite_gapfill all pass.
- The residual tier2/satellite chr21 diff is pre-existing threads>=2
  non-determinism: two same-code reruns differ by 53 lines, > the 44-line
  before/after diff, with catch-all identical in all cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…aize 3A +96%

Re-ran Filip's cross-tool benchmark on all 3 species with the catch-all ENABLED
(CATCHALL_SCAN=1 id0.72 + MIN_COPIES=3 precision filter), full-genome:
- Human GRCh38: region recall 57.62->80.69% at 50.2% raw / 79.1% adjusted precision
  (catchF). The MIN_COPIES=3 filter recovers +1.7pp adjusted precision over catchH.
- Arabidopsis Col-CEN: neutral->negative (CEN180 monomer recall saturated 99.67->99.68%,
  bp precision 65.5->60.7%) — satellite-dominated, period outside catch-all range.
- Maize Mo17: 3A microsatellite bp +96% (11.58M->22.72M); 3B/3C satellite unchanged.

Conclusion: catch-all is a short-STR/microsatellite mechanism (period<=20bp) — ON for
STR-focused runs, OFF for satellite/centromere runs. Runs used 64GB (190GB request
stalled ~12 days in the full PGL partition; SLURM allocates by request not usage).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…ecies results)

Self-contained page for Filip: what's new (catch-all + precision filter + refactor),
build, how to run (op-point env table: catchF/catchH/catchT/OFF), exact per-genome
commands, all-tools result tables (Exp1 human / Exp2 Col-CEN / Exp3 maize 3A/3B/3C),
how to score, and Pronghorn ops notes (64GB not 190GB; scontrol broken).

Key results: human region recall 57.6->80.7% (79.1% adjusted prec), maize 3A
microsatellite +96%, Col-CEN/maize-satellite unchanged. Rule: catch-all ON for
STR/microsatellite runs, OFF for satellite/centromere runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
@wyim-pgl

Copy link
Copy Markdown
Owner Author

@framazan I've sent you a collaborator invite (read access) for this repo — please accept it at https://github.com/wyim-pgl/bwt-algorithm/invitations and I'll add you as a reviewer. This PR adds the catch-all STR detection algorithm and a full 3-species benchmark for your cross-tool comparison; the run methods, op-points, all-tools result tables, and scoring steps are in docs/2026-06-25-catch-all-benchmark-for-filip.md. TL;DR: human de-novo recall 23%→80.7% (79.1% adjusted precision), maize 3A microsatellite +96%, satellite families unchanged at max. Catch-all is opt-in (default OFF). Happy to walk through it.

@wyim-pgl

Copy link
Copy Markdown
Owner Author

Friendly reminder, @framazan 🙂 — the collaborator invite (so I can formally request your review on this PR) is still pending. When you get a chance, please accept it here: https://github.com/wyim-pgl/bwt-algorithm/invitations — once you do, the review request goes out automatically. Run commands + per-genome results are in docs/2026-06-25-catch-all-benchmark-for-filip.md. Thanks!

@wyim-pgl wyim-pgl requested a review from framazan June 28, 2026 16:41
@wyim-pgl

Copy link
Copy Markdown
Owner Author

Thanks for accepting, @framazan! Review request added. See docs/2026-06-25-catch-all-benchmark-for-filip.md for run commands + results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant