fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch) by ChaoZheng109 · Pull Request #1076 · hw-native-sys/simpler

ChaoZheng109 · 2026-06-17T06:51:08Z

Fixes the onboard a2a3 half of #1070 — the flaky out
golden mismatch on spmd_paged_attention_highperf b1_h32_kv8_s128_bs128_fp16.
(The a2a3sim TIMEOUT half was already fixed in #1063.)

Root cause (confirmed by the kernel owner)

The highperf decode pipeline hands GM tiles between AIC and AIV cores using
ffts_cross_core_sync (FFTS mode2) as the only producer→consumer barrier.
mode2 orders the signal against the producing pipe but does NOT imply a DDR
fence — so a consumer core can observe the *_READY_DECODER flag before the
producer's GM tile is globally visible. Latent since the kernel was written;
exposed once #1056 enlarged the dispatch payload and tightened AICPU→AICore
dispatch timing enough to overlap the producer write with the consumer read.

Signature matched the diagnosis exactly: all 32 heads over-tolerance at once
with run-to-run-varying magnitude (max_diff 0.39–3.86), ~33% onboard, plus
the 507018 device-poison cascade. All-heads + varying ⇒ a true cross-core race
on a shared resource, not a deterministic stale read.

@MirkoDeVita98 (kernel owner) confirmed both points in #1070:

reduce_flag_id=3 was a bug, not intentional. mode2 is not a DDR fence;
READY_DECODER producers need an explicit DDR barrier before FFTS signaling.

Fix

DDR fence before every decode cross-core signal. New
FlushGmBeforeCrossCoreSignal helper applied to all six decode handoffs
(QK / SOFTMAX / UPDATE _READY_DECODER and their _STAGE2 ping-pong
twins): pipe_barrier drains the producing pipe, dcci(CACHELINE_OUT)
writes back the data-cache (MTE3) stores, and dsb(DSB_DDR) is the barrier
that also orders the FIXPIPE GM stores. Mirrors PTO-ISA's own
SyncAll / TSyncCVID flush-before-signal idiom; the __CPU_SIM dsb(0)
spelling matches the existing repo idiom (pto_async_kernel_api.h,
deferred_notify_demo).
Dedicated FFTS flag for the split-KV reduce (was aliasing
QK_READY_DECODER = 3). A real latent bug for kv_split_core_num > 1;
confirmed not intentional.
Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after Fix: scheduler timeout per platform variant (sim 5s, onboard 2s) #1063).
Keeps the per-mismatch localizing diagnostic (_diagnose_mismatch in
scene_test.py) from the investigation — useful general golden-mismatch
localizer, can be split out if reviewers prefer.

Verification

a2a3sim: test_run passes (compiles clean — dcci/dsb/DSB_DDR resolve
under both ccec and the __CPU_SIM g++-15 path — and golden-correctness
preserved). The race does not reproduce in sim.
onboard a2a3: validated by multiple st-onboard-a2a3 CI rounds below — the
race only surfaces under CI multi-core concurrency (does not reproduce at
≤2-card local).

Closes #1070 once the onboard CI rounds are green.

…head mismatch diag Debug PR to reproduce the flaky spmd_paged_attention_highperf onboard 'out' golden mismatch under CI multi-core concurrency (it does not reproduce at <=2-card local concurrency). CI st-onboard-a2a3 is EXPECTED to go red — this PR is for diagnosis, not merge. - Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063). - _compare_outputs: on mismatch, print the worst element's multi-dim index and a per-row (flattened leading dims, e.g. per-head) over-tolerance breakdown. Random bad rows across reruns => concurrency race; a fixed set => deterministic per-row bug. Best-effort, never raises, so it cannot mask the real AssertionError.

gemini-code-assist · 2026-06-17T06:51:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-06-17T06:51:17Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8a7b441-64e8-46ab-b20e-bde68b26ee0c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… aliasing flag 3) The highperf paged-attention decode kernel reused FFTS flag id 3 for two distinct cross-core barriers: QK_READY_DECODER (constexpr 3) in the decode pipeline, and the hardcoded reduce_flag_id=3 in the split-KV CombineScale reduce. FFTS semaphores are saturating counters with no identity, so the two uses alias onto one hardware flag: the reduce's wait_flag_dev can be released early by the decode path's set, before the per-core partials' GM writes are globally visible -> flaky all-heads 'out' golden mismatch with run-to-run varying magnitude. Latent since the kernel was written; exposed onboard once simpler#1056 enlarged the dispatch payload and tightened AICPU->AICore dispatch timing enough to overlap the two flag-3 uses. Two flush experiments (host device sync, AICore exit pipe_barrier) did NOT fix it, precisely because the data is wrongly computed (early release), not unflushed. Move the reduce to a free flag (9; 0-8 used by the pipeline, 11-15 reserved by PTO-ISA) and wait on the same id rather than a hardcoded 3.

ChaoZheng109 · 2026-06-17T11:21:06Z

Investigation done — onboard golden mismatch root-caused to a latent cross-core race in the highperf decode kernel exposed by #1056's dispatch-timing change; full analysis + 4 ruled-out hypotheses posted on #1070. Closing this draft (no fix; not a merge candidate). Branch investigate-1070-highperf-acc kept for repro + the flag-id latent-bug fix.

The highperf paged-attention decode pipeline hands GM tiles between AIC and AIV cores using ffts_cross_core_sync (FFTS mode2) as the only producer->consumer barrier. mode2 orders the signal against the producing pipe but does NOT imply a DDR fence (confirmed by the kernel owner), so a consumer core could observe the READY flag before the producer's GM tile was globally visible. Once hw-native-sys#1056 enlarged the dispatch payload and tightened AICPU->AICore dispatch timing, the producer write and consumer read overlapped -> flaky all-heads 'out' golden mismatch onboard a2a3 with run-to-run-varying magnitude (~33%), plus the 507018 device-poison cascade. Flush each producer's GM tile to DDR before signaling, via a new FlushGmBeforeCrossCoreSignal helper applied to all six decode handoffs (QK / SOFTMAX / UPDATE _READY_DECODER and their _STAGE2 ping-pong twins): pipe_barrier drains the producing pipe, dcci(CACHELINE_OUT) writes back the data-cache (MTE3) stores, and dsb(DSB_DDR) is the barrier that also orders the FIXPIPE GM stores. Mirrors PTO-ISA's own SyncAll / TSyncCVID flush-before-signal idiom; the __CPU_SIM dsb(0) spelling matches the existing repo idiom (pto_async_kernel_api.h, deferred_notify_demo). Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063). Passes on a2a3sim; onboard a2a3 validated by multiple st-onboard-a2a3 CI rounds (the race only surfaces under CI multi-core concurrency).

ChaoZheng109 · 2026-06-18T14:46:33Z

⚠️ Onboard CI: producer-side DDR flush is necessary but NOT sufficient — race still reproduces

Ran 5 st-onboard-a2a3 rounds on this branch (run 27763273991, attempts 1–5):

round	st-onboard-a2a3	highperf b1 `out`
1	✅ pass	PASSED
2	✅ pass	PASSED
3	❌ fail	FAILED — `Golden mismatch on 'out': max_diff=0.39404296875`
4	✅ pass	PASSED
5	✅ pass	PASSED

Attempt-3 failing line (via gh api repos/.../actions/jobs/82147513366/logs):

[gw2] [92%] FAILED .../spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py::...::test_run
  AssertionError: Golden mismatch on 'out': max_diff=0.39404296875, rtol=0.005, atol=0.02
--- L2 tensormap_and_ringbuffer: FAIL rc=1 83.0s ---

max_diff=0.394 is the exact original #1070 signature — so the flush-before-signal on the 6 decode FFTS handoffs reduced but did not eliminate the cross-core race (~1/5 ≈ 20%, vs. the originally-reported ~33%; within noise of "no change").

Most likely remaining gap

PTO-ISA's own SyncAll fences both sides — producer dcci(CACHELINE_OUT)+dsb and consumer dcci(SINGLE_CACHE_LINE)+dsb (invalidate) before the cross-core read. This PR only added the producer half. The consumers (SoftmaxStage1 reads s_gm; ProcessPV reads p_gm; SoftmaxStage2 reads o_tmp_gm) do no cache-invalidate before reading, so a consumer core can still read a stale cached copy even after the producer flushed to DDR. Likely next experiment: add a consumer-side dcci-invalidate + dsb right after each wait_flag_dev(*_READY_DECODER), before the GM read.

Not ready to merge. Keeping open; the case should stay off the onboard run until the race is actually closed.

…S handoffs The producer-side DDR flush alone (prior commit) left the all-heads 'out' golden mismatch reproducing ~1/5 onboard a2a3 (run 27763273991 attempt 3: max_diff=0.394, the original hw-native-sys#1070 signature). Producer flush writes the GM tile back to DDR, but the consumer core can still read a stale copy from its own data cache. Add the consumer half, mirroring PTO-ISA SyncAll's two-sided dcci/dsb: after each wait_flag_dev(*_READY_DECODER) and before the GM read, InvalidateGmAfterCrossCoreWait drops the local data cache for the producer's tile (dcci invalidate + dsb(DSB_DDR) + pipe_barrier). Applied to all five decode consumer waits: AIV reads s_gm (QK_READY_DECODER / _STAGE2), cube reads p_gm (SOFTMAX_READY via softmax_ready_flag), AIV reads o_tmp_gm (UPDATE_READY_DECODER / _STAGE2). Passes on a2a3sim; onboard a2a3 to be validated by multiple st-onboard-a2a3 rounds.

MirkoDeVita98 · 2026-06-19T07:55:49Z

@ChaoZheng109 Fix for long sequences paged attention highperf: #1091

ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch 2 times, most recently from 8cc2a74 to 68c938b Compare June 17, 2026 09:34

ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch from 1654840 to 68c938b Compare June 17, 2026 11:20

ChaoZheng109 closed this Jun 17, 2026

ChaoZheng109 reopened this Jun 18, 2026

ChaoZheng109 changed the title ~~[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag~~ fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch) Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch)#1076

fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch)#1076
ChaoZheng109 wants to merge 4 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc

ChaoZheng109 commented Jun 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Review skipped

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Uh oh!

ChaoZheng109 commented Jun 18, 2026

Uh oh!

MirkoDeVita98 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChaoZheng109 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause (confirmed by the kernel owner)

Fix

Verification

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Uh oh!

ChaoZheng109 commented Jun 18, 2026

⚠️ Onboard CI: producer-side DDR flush is necessary but NOT sufficient — race still reproduces

Most likely remaining gap

Uh oh!

MirkoDeVita98 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChaoZheng109 commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading