Catch-all STR detection + Exp1 de-novo sensitivity overhaul (recall 23%→80.7%, ULTRA-level)#6
Catch-all STR detection + Exp1 de-novo sensitivity overhaul (recall 23%→80.7%, ULTRA-level)#6wyim-pgl wants to merge 31 commits into
Conversation
Document the verified non-Docker Linux install: conda-forge compiler env, the channel-priority solver-hang workaround (--override-channels), and the required Cython/C extension build (without it the tool reports 0 repeats). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Design/spec for closing BWTandem's adotto recall gap (36.87% vs ultra/tantan 75-81%) on the general-TR benchmark. Records the verified diagnosis (real short-STR sensitivity gap, not a naming artifact), the agreed strategy (parameter sweep first, then surgical code changes; re-score all tools on a downloaded public adotto catalog), and the six-step plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose Tier1's hardcoded detection thresholds as env-var overrides so the short-tandem-repeat sensitivity/precision trade-off can be swept without recompiling. Defaults reproduce the original behaviour exactly (verified: a fresh chr21 run with no env vars matches the published baseline — 29.50% adotto region recall, 79.50% precision). New knobs (TIER1_*): MIN_COPIES, MIN_ARRAY_LEN, MIN_SCORE, COPYBASE, COPYADD (dynamic_min_copies formula), EXT_COPIES (perfect seed copies before mismatch extension), MISMATCH, MIN_ENTROPY. Context (#3, Human GRCh38 general-TR): the recall gap vs ultra/tantan is concentrated in short periods (<=20 bp, 85% of adotto regions). On chr21, TIER1_MIN_ARRAY_LEN=20 TIER1_MIN_SCORE=20 (+relaxed copies) lifts overall adotto region recall 29.50% -> 50.76% while keeping adjusted precision (call in adotto OR confirmed by ultra/tantan) at 85.1% and raw precision (67.7%) still above ultra (54.2%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tier1: add an opt-in low-complexity entropy gate (TIER1_ENTROPY_GATE, default off). Empirically it does NOT help on short STRs — it removes real low-entropy repeats (AT/AAT-type) and recall drops without precision gain, because spurious and genuine short STRs are intrinsically indistinguishable. Kept as a documented, default-off option. tier2: expose the simple-scan detection thresholds for the period-10-20 regime as env vars (TIER2_MIN_COPIES, TIER2_MIN_ARRAY_LEN, TIER2_SHORT_REQ_COPIES, TIER2_MISMATCH). Defaults reproduce the original behaviour exactly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tuned config (comboA) lifts adotto region recall 29.5->52.3% (chr21) and 29.8->55.7% (chr22) while keeping adjusted precision ~85% and raw precision above ultra/tantan. Documents the diagnosis (gap concentrated in period <=20), the failed entropy-gate experiment, the Tier2 period-10-20 contribution, and the honest ceiling. Full-genome SLURM validation pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 'Tuning detection sensitivity (env vars)' section (TIER1_*/TIER2_* thresholds, all defaulting to original behaviour) and a note that only _accelerators.pyx edits need a Cython rebuild — pure-Python tier edits do not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full GRCh38 primary (chr1-22,X,Y) validation of comboA (SLURM job 5713685, 2h44m, 1,528,753 calls). All tools re-scored identically vs the full adotto v1.2.1 catalog (1,784,804 GT regions): - adotto region recall 23.03% -> 44.36% (+21.3 pp, ~1.93x); bp recall 22.34% -> 30.60% — the chr21/22 ~2x gain generalises genome-wide. - improved bwtandem overtakes trf on recall (44.36% vs 31.88%) and stays the most precise de novo caller (raw 66.41% > ultra 53.65%, tantan 57.66%). - adjusted precision 88.4% -> 82.9%; of the new out-of-catalog calls 49.1% are ultra/tantan-confirmed, so genuine over-calling is only ~17% of calls. - honest ceiling confirmed: still below the ultra/tantan 78-82% recall band. Memory constraint NOT met: peak ~149 GB RSS at --threads 4 (vs 41.98 GB reference), dominated by 4 concurrent per-chromosome FM-indexes; reduction (plan item #5 / lower --threads) remains open. Recall result stands independently. Replaces the Pending section with the final genome-wide table + reproduce steps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BWTCore built two O(n) Python structures that nothing reads: - kmer_hash: its only consumer get_kmer_positions() is never called (and falls back to the FM-index anyway), yet __init__ built a dict of per-position Python int lists for the whole chromosome. - sampled_sa: its only consumer _get_suffix_position() is never called; locate_positions() reads suffix_array directly. With the hardcoded sa_sample_rate=1 (finder.py) this duplicated the entire suffix array as a Python int->int dict. Both grew O(n) in Python objects and dominated per-chromosome RAM. Setting them empty cuts chr21 single-thread peak RSS 8.60 GB -> 1.37 GB (6.3x) with the adotto metric unchanged: region recall 52.33%, precision 65.78%, bp recall 42.83% are identical across 3 old vs 3 new runs. (The tool's BED already varies ~0.1% run-to-run -- old vs old differ too -- and new output stays within that band, so exact-diff was never the right oracle; metric preservation is.) get_kmer_positions() still works (empty hash -> FM-index fallback). Addresses the Exp1 #3 memory item; full-genome re-measurement to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-ran the full GRCh38 primary genome with the dead-structure fix (40d4f2d) at --threads 4 (SLURM job 5713697, 2h37m). Peak RSS dropped 149.18 GB -> 29.63 GB (5.0x), now well under the 41.98 GB reference target, and ~7 min faster. Recall is preserved exactly: re-scoring bwt_hg38_lowmem vs the full adotto catalog gives identical region recall 44.36%, precision 66.41%, bp recall 30.60% (the BED varies ~0.1% run-to-run independent of this change). Updates the Memory section from "NOT met" to "MET" with the before/after table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Approved design (brainstorming output) for pushing bwtandem recall toward/past tantan (78%) and adding a CHM13 centromere/satellite strength. Grounded in the genome-wide recall-gap diagnosis: gap is short-period STRs (period 1-3 = 38.7% of recoverable, within-class recall 26.6%), driven by copy-count gating and exact-only Tier1 seeding, not purity; tuning alone caps at ~55% recall. Phase 1 (genome core, staged): C1 mismatch-tolerant short-STR seeding, C2 rolling-consensus extender (bp-precision), C3 copy-gate relaxation, C4 period 10-20 path, C5 merge relaxation, C6 re-sweep + genome validation. Phase 2: CHM13 multi-anchor Tier3 satellite seeding + satellite-stratified scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TDD-structured plan for the genome-core recall changes: C1 mismatch-tolerant short-STR seeding (stitch perfect sub-runs), C2 rolling-consensus extender, C4 period 10-20 path, C5 merge + purity-gated 2-copy, C6 re-sweep + genome validation. Synthetic-STR behavioral tests + real-data scoring gates per step. Phase 2 (centromere/CHM13) deferred to its own plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…get) Adds tests/test_maxrecall_seeding.py — a standalone (no pytest) behavioral test that encodes the recoverable-miss class for short STRs: period 1-3, 82-92% purity, 7-15 copies. Fixed seed (42) makes results deterministic. On the comboA baseline (TIER1_MIN_ARRAY_LEN=20 TIER1_MIN_SCORE=20) exactly one case FAILs: FAIL mono_A_15cop_08mm (15 bp < required_threshold=20) With the bare defaults (MIN_ARRAY_LEN=26, MIN_SCORE=30) three FAILs. These are the misses that C1 (mismatch-tolerant seeder) and C2 (rolling extender) will fix in later tasks. Also records chr21 comboA baseline: regRecall=52.33%, regPrec=65.78%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… recall (#3) Short perfect STR cores (e.g. a 7-copy dinucleotide = 14 bp) frequently sit inside a much larger adotto region but were rejected by Tier 1's global acceptance gate (required_threshold = max(MIN_ARRAY_LEN, ...) and the MIN_SCORE length*purity gate). This dominated the chr21 short-period (1-9) recall gap. Add a period-stratified relaxation behind new env knobs, all defaulting to current behaviour (unset == baseline, exactly reproduces c0): TIER1_SHORT_PERIOD_MAX (default 0 = disabled) TIER1_SHORT_MIN_ARRAY_LEN (default = TIER1_MIN_ARRAY_LEN) TIER1_SHORT_MIN_SCORE (default = TIER1_MIN_SCORE) When motif_len <= SHORT_PERIOD_MAX the gate uses the relaxed short floors; longer motifs keep the strict global gate so long-repeat precision is unaffected. Also add opt-in stitch-seeding (TIER1_STITCH_GAP, default 0) which merges phase-aligned adjacent perfect sub-runs of the same period. Measured (full pipeline, real adotto scoring), recommended operating point on top of comboA: SHORT_PERIOD_MAX=4 SHORT_MIN_ARRAY_LEN=18 SHORT_MIN_SCORE=18 chr21: region recall 52.33% -> 56.69% (+4.36pp) at precision 59.17% p1-9 recall 45.44% -> 50.85% (+5.41pp) chr22: region recall 55.69% -> 59.58% (+3.89pp) at precision 65.71% p1-9 recall 47.82% -> 52.86% (+5.04pp) Both clear the +3pp recall target and stay above tantan's 57.66% precision floor. Stitch-seeding tested but ~neutral on short-period recall (+0.2pp). Rewrite tests/test_maxrecall_seeding.py into a real regression check for the gate: a 15 bp mono-A core is rejected at baseline, accepted when SHORT_PERIOD_MAX>=1 with relaxed floors, and rejected again when SHORT_PERIOD_MAX=0 (proving the period stratification, not just lower floors, gates the relaxation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed arrays (#3) The original `extend_with_mismatches` compares every copy to a FIXED seed copy and breaks on CUMULATIVE mismatch, so diverged STR arrays truncate early (median ~47.6% coverage of the true adotto span, bp-precision ~28%). This adds an opt-in rolling extender that: - maintains a per-position MAJORITY consensus (vote table) updated as copies are accepted, comparing each new copy to the current consensus rather than a fixed seed copy; - breaks only after K consecutive "bad" copies (per-copy mismatch fraction over the allowed rate), extending in BOTH directions and trimming trailing bad copies so the call ends on a good copy; - drops the period==1 zero-tolerance special case in `_max_mismatch_threshold` so homopolymers get the normal allowed rate. Env-gated for revertibility and baseline preservation (read once at module load): TIER1_EXT_ROLLING (default 0 = original fixed-consensus cumulative behaviour, byte-for-byte; 1 = new rolling/windowed) and TIER1_EXT_BAD_RUN (K consecutive bad copies, default 2, only used when ROLLING=1). Function signature is unchanged so tier1.py / tier2.py / bwt_seed.py need no edits. Adds tests/test_maxrecall_extender.py: with ROLLING=1 (+ Task1 period-stratified flags) Tier1's best call covers >=90% of an implanted diverged (AC)x40 @90% array; the default (ROLLING unset) path still runs without crashing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The windowed "K consecutive bad copies" break alone runs away on real low-complexity / homopolymer DNA: for period 1 a single-base copy can never exceed per_copy_max, so bad_run never increments and the extension walks the whole array; the rolling consensus also drifts and tolerates non-periodic sequence. On chr21 this over-extended period-1/2 calls across long stretches, which both hurt precision and exploded downstream refine_repeat DP cost (full chromosome did not finish in ~70 min; tier1-only on a 1 Mb slice went from 3 s to many minutes). Add a cumulative-purity guard mirroring the old path's _total_mismatches cap but measured against the rolling consensus: stop extending in a direction once the running mismatch fraction over the accepted span would exceed the allowed rate (with +1 grace so a lone SNP in a short array is not cut). This preserves the full-span extension of genuinely periodic diverged arrays (perfect/diverged (AC)xN still reach their true ends) while preventing runaway in random DNA (random p1/p2 now stop at span ~3-4 instead of the whole array). Tier1-only on the 1 Mb slice is back to ~3 s (parity with rolling OFF). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ion wall) New opt-in Tier-1 source (TIER1_FMSCAN, default OFF) that recovers diverged short-period STRs (p1-9) the exact adjacent-copy seeder misses, AT high precision -- something gate-relaxation could not do. Method: enumerate primitive p-mers, locate all FM-index occurrences, detect gap-tolerant periodic runs (missing copies = diverged copies absorbed), and accept only runs that clear a DENSITY floor (observed/expected copies) and a Poisson LOG-LIKELIHOOD-RATIO vs i.i.d. background. This statistical filter is the discriminator crude length/score gates lack: it separates real-but-diverged arrays from random low-complexity DNA. Longest-period-first + a greedy seen_mask report each array once by its best motif/phase, keeping candidate counts low. Additive mode (=1) runs on the residual the sliding-window scan did not claim, so it only ADDS missed diverged STRs. Validated dominance over the gate-tuning OP1 (base env), adotto + ultra/tantan adjusted precision: chr21: 52.33 -> 58.97 region recall at equal precision (raw 65.8, adj 83.1->83.6) chr22: 55.69 -> 61.60 region recall at equal precision (raw 70-72, adj ~86) short p1-9 recall: +8.4 (chr21) / +7.3 (chr22); +1314 p1-9 regions on chr21. Default OFF preserves baseline byte-for-byte. See docs/2026-06-22-exp1-fmscan-diverged-str-detector.md for the frontier table, runtime, and tuning knobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… flags int()/float() crash on an empty string, so an exported-but-empty env var (e.g. TIER1_FMSCAN="") would abort tier1 init. Switch the NEW opt-in flags (TIER1_FMSCAN*, TIER1_SHORT_PERIOD_MAX, TIER1_SHORT_MIN_ARRAY_LEN, TIER1_SHORT_MIN_SCORE) to `os.environ.get(...) or "<default>"` so empty and unset both fall back to the documented default. Valid values are unchanged. Verified: tests/test_maxrecall_seeding.py passes; constructing Tier1STRFinder with all new flags set to "" yields the exact documented defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tier2 long-unit "Phase A (SA+LCP)" segfaulted on CHM13 centromeric satellite arrays (alpha-HOR chrX, HSat3 chr9). faulthandler + ASan pinpointed the root cause in the per-copy banded-DP aligner, not the LCP/extend path that was first suspected. Root cause: int32 overflow in the traceback-table sizing/indexing of both aligners. On a satellite, Phase A's LCP scan emits spurious candidate periods up to ~max_period (e.g. 98,728 bp). refine_repeat() then aligns a ~99 kb "motif", so the full (m+1)x(w+1) DP table is ~1e10 cells. That product was computed in 32-bit int: - C align_accel.c:align_one malloc((m+1)*cols) and i*cols - Cython align_unit_to_window malloc(rows*cols) and i*cols Both wrapped past INT32_MAX to a small/negative size, malloc returned an undersized buffer, and the DP fill wrote far out of bounds -> heap corruption -> SIGSEGV. Proven with AddressSanitizer: "signed integer overflow: 98729 * 108602 ... heap-buffer-overflow WRITE ... in align_one". Fix (root cause, not a mask): - align_accel.c / _accelerators.pyx: compute table size and every i*cols index in 64-bit (size_t/long); add a sane O(m*w) table-size cap (256 M cells, ~16 kb motif) so an absurd pseudo-motif is skipped (return 0/None) instead of overflowing or allocating ~10 GB. Adds the missing malloc-failure guards. No behaviour change for any motif size that worked before. - tier2.py: skip Phase A per-copy DP for periods above a tunable bound (TIER2_LONGUNIT_DP_MAX_PERIOD, default 8192). On satellites the huge spurious LCP-multiple periods yield no valid repeat yet caused an ~30 GB / OOM blow-up; long real units are recovered by Phase B and Tier 3. This makes the arrays actually run to completion. Verification: - chrX alpha-HOR (3.15 Mb): completes, no SIGSEGV, Phase A "found 109", 88 repeats, 92.8% bp-recall / 100% precision, 308 s, 0.35 GB peak. - chr9 HSat3 (3.0 Mb): completes, no SIGSEGV, Phase A "found 202", 21,496 repeats, 71.8% bp-recall / 100% precision, ~25 min, 9.7 GB. - Regression: Arabidopsis ChrC and ChrM (tier1,tier2,tier3) BED output byte-identical before vs after; 30k-case differential of the C aligner vs the pre-fix build matches except for inputs that hit the aligner's pre-existing uninitialised-column-0 UB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_fill_satellite_gaps left large interior gaps in centromeric HOR arrays
uncovered: ~93% of CHM13 bp-recall misses were array-interior, not edge.
Three gates blocked the backstop from filling divergent alpha-satellite:
- satellite anchors gated to motif 100-300bp, so the large HOR-unit calls
(motif 2057-6171bp) flanking a gap never marked it near-satellite
- identity floor 0.55 rejected divergent arrays (~0.46-0.54 monomer identity)
- autocorrelation period window 100-300 excluded the dimeric HOR period
(2x171=342) where the most divergent array (chr3) peaks
All three are now env-overridable (SAT_FILL_MIN_IDENTITY=0.45,
SAT_FILL_MIN/MAX_PERIOD=100/360, SAT_ANCHOR_MIN/MAX_MOTIF=100/0) with the
improved values as defaults.
CHM13 6-array benchmark: mean bp-recall 92.5%->99.8% (per-array 99.5-99.99%,
was 90.6-95.8%), mean bp-precision held/improved 94.2->94.6%. chr3 (hardest)
91.06->99.50%. Neutral on the chr21/chr22 GRCh38 adotto short-STR benchmark
(region recall/precision flat within 0.06pp). Regression test added with a
random-gap negative control.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WsKubE578JXkjsVTL4MPo2
Add an Environment bullet to Dependencies: environment.yml defines the `bwt` conda env (pytest + competitor tools trf/tantan/tidehunter/ncrf/mreps), pydivsufsort needs --no-build-isolation, and the benchmark scripts call a separate prebuilt interpreter than the one in environment.yml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WsKubE578JXkjsVTL4MPo2
Staged config→code loop to push adotto region recall toward 82% while holding precision >= 57.66% (tantan floor). Builds on 2026-06-21 max-recall diagnosis; remaining levers = period 10-20 (C4), FM-scan p7-9, merge (C5). chr21/22 proxy + full-genome validation, /loop self-paced, ledger-based memory, sbatch-only (login node 4GB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
) The period 10-20 notch: Tier2's exact-seeding paths (LCP plateau >=8bp, exact k-mer recurrence) find ZERO seeds in arrays carrying a substitution in every copy, so diverged short/medium STRs (the largest adotto recall gap vs ULTRA: 59% -> 89%, ~358k regions) are missed entirely. Phase C (opt-in TIER2_APPROX_SEED, default off = baseline unchanged) detects local periodicity directly: per period p, vectorized identity between offset-p windows (cumsum sliding window, O(n)); seed each high-identity run at its max-identity interior position; extend + refine for boundaries. Fixes vs a naive version: (1) seed at run argmax not boundary (clean consensus); (2) motif from the interior seed copy not the over-extended full_start; (3) trim-retry refine spans inward by one period when the extender creeps into flank; (4) build Phase C exclusion from tier1 + accepted results only, ignoring Phase B's speculative covered-mask marks that otherwise hide whole arrays. Test: tests/test_loop_p1020.py — a diverged p13 array baseline misses entirely (0bp) is recovered to 519/520bp. Regression: 14 passed (Tier1/2/3 ground-truth + satellite gap-fill), baseline byte-unchanged with the flag off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
The first seeder version hung on real chromosomes: low-complexity DNA clears the identity threshold over millions of single windows, and np.split + a per-seed extend call on each made Phase C effectively non-terminating (>31 min on chr21). Rewrite: vectorized contiguous-run extraction over the identity mask; a min-run-length filter (>= period window-starts) drops single-window noise; array bounds are taken directly from the run extent (no per-seed extend_with_mismatches); the refine start is phase-aligned to the seed copy (an unaligned start lands mid-copy and refine_repeat fails to lock on); per-period seed cap TIER2_APPROX_MAX_SEEDS (default 200000) bounds the worst case. Result on chr21: Phase C 31 min -> 29.7 s, recovering 1688 diverged period-10-20 arrays the exact-seed paths miss. Synthetic test still recovers 519/520 bp; regression 14 passed; baseline unchanged with TIER2_APPROX_SEED off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…8% precision The 2026-06-23 maximization loop's validated full-genome adotto operating point (v2.1): comboChi base + TIER2_MISMATCH=0.30 + TIER1_FMSCAN_MIN_DENSITY=0.45 + TIER1_FMSCAN_MIN_LLR=6.0. Region recall 57.62->59.04 (+1.42pp), bp recall 35.11->39.42 (+4.31pp), region precision 58.98 (holds tantan's 57.66% floor — bwtandem stays the de-novo precision leader). 82% region recall is unreachable at >=57.66% precision (ULTRA reaches 82% only at 53.65%; env-lever ceiling ~65% proxy at the floor, gate-lowering collapses precision). The period-10-20 autocorrelation seeder (29ab962) is documented opt-in/default-off for adotto (region recall flat, bp precision 31->13.5% from low-complexity over-call); kept for potential satellite use. CLAUDE.md tuning section + benchmarking_results_updated.md Exp1 row updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
The 3-tier pipeline only emits where it can form a seed, so diverged short STRs (a substitution in most copies) are missed entirely — the dominant region-recall gap vs ULTRA/tantan. _catchall_periodicity_fill (opt-in CATCHALL_SCAN, default off = baseline unchanged) is a low-stringency autocorrelation sweep over short periods (CATCHALL_MIN_P..MAX_P, default 1-20) in the regions NO tier/satellite pass covered: for each period it computes vectorized identity between offset-p windows (cumsum sliding window, O(n)), extracts high-identity runs above CATCHALL_MIN_IDENTITY (default 0.72) in uncovered DNA, and emits a call per run (longest period first so genuine k-mer STRs claim their region). It deliberately trades precision for recall — the identity/length knobs (CATCHALL_MIN_IDENTITY, CATCHALL_MIN_LEN) set the operating point for the ULTRA-level (~53% precision) target. chr21: +26978 calls in 16 s (52560 total), EXIT=0. Regression 14 passed; baseline byte-unchanged with CATCHALL_SCAN off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…ULTRA) PHASE 3 of the recall loop: with the precision target relaxed to ULTRA's level (~53%), the new catch-all periodicity algorithm (CATCHALL_SCAN, finder.py 15d4273) breaks the recall ceiling. Validated full-genome adotto frontier on the v2.2 gate base (v2.1 env + period-9 gate floors 17 + CATCHALL_SCAN=1): - identity 0.76 (+MAX_P 50): 72.88% recall / 52.72% precision (~ULTRA precision, +15.3pp recall over v2's 57.62%) - identity 0.72: 82.35% recall / 47.99% precision (beats ULTRA's 81.62% recall) - identity 0.68: 84.42% recall / 42.61% precision (max recall) bwtandem now reaches/exceeds ULTRA-level recall; frontier close to but slightly inside ULTRA's 81.62/53.65 corner. bp precision (~32%) is the cost (low-complexity over-call); a post-call quality filter is the next precision lever. v2.1 (59.04/58.98) stays the precision-leader op-point; catch-all is opt-in/default-off. CLAUDE.md tuning + benchmarking_results_updated.md Exp1 updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
Add two intrinsic filters inside _catchall_periodicity_fill to recover the ~32% bp / ~48% region precision the catch-all trades for recall, without a re-run: - CATCHALL_MIN_COPIES (default 2 = no-op; window=2*period already forces copies>=2). Raising to 3 drops the shortest 2-copy over-calls. - CATCHALL_MIN_ENTROPY (default 0 = off; per-base Shannon entropy via new _seq_shannon_entropy) drops homopolymer-ish low-complexity calls. Defaults reproduce baseline exactly. chr21+chr22 proxy sweep (f_* variants) picks CATCHALL_MIN_COPIES=3 as the op-point: pool precision 50.00%->51.57% (+1.57pp) at pool recall 84.05%->82.73% (held >=82%). Entropy gate gave ~0 precision gain on adotto region precision, left off. Regression: test_ground_truth (11), test_loop_p1020, test_satellite_gapfill all pass (CATCHALL off = baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
finder._fill_satellite_gaps, finder._catchall_periodicity_fill and tier2._autocorr_seed each independently re-derived the same "self-similarity at offset p" math; the catch-all and tier2 seeders carried a verbatim ~15-line cumsum-windowed-count + int8-view run-extraction block. Factor the period math into src/autocorr.py: - autocorr_identity(arr, period) -> scalar (satellite gap-fill) - windowed_match_counts(arr, period, window)-> O(n) sliding window (catch-all, tier2) - contiguous_true_runs(mask) -> run starts/ends (catch-all, tier2) Behavior-preserving by construction (same numpy ops). Verified: - tests/test_autocorr.py (12) asserts each helper reproduces the exact inline block (cumsum int64, int8-view diff) + brute-force identity. - chr21 catch-all output byte-identical before/after (13886 calls, 0 diff). - test_ground_truth (11) / test_loop_p1020 / test_satellite_gapfill all pass. - The residual tier2/satellite chr21 diff is pre-existing threads>=2 non-determinism: two same-code reruns differ by 53 lines, > the 44-line before/after diff, with catch-all identical in all cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…aize 3A +96% Re-ran Filip's cross-tool benchmark on all 3 species with the catch-all ENABLED (CATCHALL_SCAN=1 id0.72 + MIN_COPIES=3 precision filter), full-genome: - Human GRCh38: region recall 57.62->80.69% at 50.2% raw / 79.1% adjusted precision (catchF). The MIN_COPIES=3 filter recovers +1.7pp adjusted precision over catchH. - Arabidopsis Col-CEN: neutral->negative (CEN180 monomer recall saturated 99.67->99.68%, bp precision 65.5->60.7%) — satellite-dominated, period outside catch-all range. - Maize Mo17: 3A microsatellite bp +96% (11.58M->22.72M); 3B/3C satellite unchanged. Conclusion: catch-all is a short-STR/microsatellite mechanism (period<=20bp) — ON for STR-focused runs, OFF for satellite/centromere runs. Runs used 64GB (190GB request stalled ~12 days in the full PGL partition; SLURM allocates by request not usage). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
…ecies results) Self-contained page for Filip: what's new (catch-all + precision filter + refactor), build, how to run (op-point env table: catchF/catchH/catchT/OFF), exact per-genome commands, all-tools result tables (Exp1 human / Exp2 Col-CEN / Exp3 maize 3A/3B/3C), how to score, and Pronghorn ops notes (64GB not 190GB; scontrol broken). Key results: human region recall 57.6->80.7% (79.1% adjusted prec), maize 3A microsatellite +96%, Col-CEN/maize-satellite unchanged. Rule: catch-all ON for STR/microsatellite runs, OFF for satellite/centromere runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP
|
@framazan I've sent you a collaborator invite (read access) for this repo — please accept it at https://github.com/wyim-pgl/bwt-algorithm/invitations and I'll add you as a reviewer. This PR adds the catch-all STR detection algorithm and a full 3-species benchmark for your cross-tool comparison; the run methods, op-points, all-tools result tables, and scoring steps are in |
|
Friendly reminder, @framazan 🙂 — the collaborator invite (so I can formally request your review on this PR) is still pending. When you get a chance, please accept it here: https://github.com/wyim-pgl/bwt-algorithm/invitations — once you do, the review request goes out automatically. Run commands + per-genome results are in |
|
Thanks for accepting, @framazan! Review request added. See |
Summary
This branch overhauls BWTandem's de-novo tandem-repeat sensitivity and adds a new
catch-all periodicity algorithm that closes the recall gap to ULTRA/tantan while
keeping BWTandem the only de-novo tool that also maxes out satellite detection.
Headline (Human GRCh38 vs full adotto v1.2.1): region recall 23% → 80.7% at
79.1% adjusted precision — de-novo recall now on par with ULTRA (81.6%), which no
masking tool reaches on satellite.
@framazan — this is for your cross-tool benchmark. A self-contained run + results guide
is in
docs/2026-06-25-catch-all-benchmark-for-filip.md(build, exact per-genome commands, op-point table, scoring, ops notes).
The new algorithm — catch-all periodicity pass (opt-in, default OFF)
finder._catchall_periodicity_fill(envCATCHALL_SCAN=1) detects local periodicitydirectly in the DNA no tier/satellite pass covered — recovering the entirely-missed
diverged short STRs that the seed-based tiers structurally cannot find (a substitution
in most copies forms no exact seed). It trades precision for recall via a tunable
CATCHALL_MIN_IDENTITY, and aCATCHALL_MIN_COPIES=3gate recovers precision by droppingthe shortest 2-copy over-calls. Default OFF — every existing run is unchanged.
3-species benchmark (catch-all OFF → ON)
Rule: catch-all is a short-STR / microsatellite mechanism (period ≤ 20 bp) — ON for
STR/microsatellite runs, OFF for satellite/centromere runs (those use 156–360 bp periods,
outside its range). Full tables with all competitor tools are in the docs page above and
exp1_human/filip_repro/benchmarking_results_updated.md.How to run
What else is in the branch (de-novo sensitivity overhaul)
period-stratified length/score gate (the recall/precision wall break, 23%→44%).
bp-recall 92.5%→99.8%).
src/autocorr.py(behavior-preserving — catch-all chr21 output byte-identicalbefore/after; verified).
Testing
pytest tests/— incl.test_autocorr.py(12, helpers reproduce the inline math exactly),test_ground_truth.py(11),test_satellite_gapfill.py,test_loop_p1020.py. All greenwith the compiled
_accelerators.so. Defaults reproduce baseline behavior exactly.Ops notes (Pronghorn)
Full-genome MaxRSS is human 37.6 GB / maize 23.7 GB — request 64 GB (a 190 GB request
stalled ~12 days in the full PGL partition; SLURM allocates by request, not usage).
scontrolis broken on the compute nodes (libreadline.so.6) — use scancel+resubmit.🤖 Generated with Claude Code
https://claude.ai/code/session_01Lip8SH3mXGherETw9ZdEVP