Skip to content

fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch)#1076

Draft
ChaoZheng109 wants to merge 4 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc
Draft

fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch)#1076
ChaoZheng109 wants to merge 4 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc

Conversation

@ChaoZheng109

@ChaoZheng109 ChaoZheng109 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Fixes the onboard a2a3 half of #1070 — the flaky out
golden mismatch on spmd_paged_attention_highperf b1_h32_kv8_s128_bs128_fp16.
(The a2a3sim TIMEOUT half was already fixed in #1063.)

Root cause (confirmed by the kernel owner)

The highperf decode pipeline hands GM tiles between AIC and AIV cores using
ffts_cross_core_sync (FFTS mode2) as the only producer→consumer barrier.
mode2 orders the signal against the producing pipe but does NOT imply a DDR
fence
— so a consumer core can observe the *_READY_DECODER flag before the
producer's GM tile is globally visible. Latent since the kernel was written;
exposed once #1056 enlarged the dispatch payload and tightened AICPU→AICore
dispatch timing enough to overlap the producer write with the consumer read.

Signature matched the diagnosis exactly: all 32 heads over-tolerance at once
with run-to-run-varying magnitude (max_diff 0.39–3.86), ~33% onboard, plus
the 507018 device-poison cascade. All-heads + varying ⇒ a true cross-core race
on a shared resource, not a deterministic stale read.

@MirkoDeVita98 (kernel owner) confirmed both points in #1070:

reduce_flag_id=3 was a bug, not intentional. mode2 is not a DDR fence;
READY_DECODER producers need an explicit DDR barrier before FFTS signaling.

Fix

  1. DDR fence before every decode cross-core signal. New
    FlushGmBeforeCrossCoreSignal helper applied to all six decode handoffs
    (QK / SOFTMAX / UPDATE _READY_DECODER and their _STAGE2 ping-pong
    twins): pipe_barrier drains the producing pipe, dcci(CACHELINE_OUT)
    writes back the data-cache (MTE3) stores, and dsb(DSB_DDR) is the barrier
    that also orders the FIXPIPE GM stores. Mirrors PTO-ISA's own
    SyncAll / TSyncCVID flush-before-signal idiom; the __CPU_SIM dsb(0)
    spelling matches the existing repo idiom (pto_async_kernel_api.h,
    deferred_notify_demo).

  2. Dedicated FFTS flag for the split-KV reduce (was aliasing
    QK_READY_DECODER = 3). A real latent bug for kv_split_core_num > 1;
    confirmed not intentional.

  3. Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after Fix: scheduler timeout per platform variant (sim 5s, onboard 2s) #1063).

  4. Keeps the per-mismatch localizing diagnostic (_diagnose_mismatch in
    scene_test.py) from the investigation — useful general golden-mismatch
    localizer, can be split out if reviewers prefer.

Verification

  • a2a3sim: test_run passes (compiles clean — dcci/dsb/DSB_DDR resolve
    under both ccec and the __CPU_SIM g++-15 path — and golden-correctness
    preserved). The race does not reproduce in sim.
  • onboard a2a3: validated by multiple st-onboard-a2a3 CI rounds below — the
    race only surfaces under CI multi-core concurrency (does not reproduce at
    ≤2-card local).

Closes #1070 once the onboard CI rounds are green.

…head mismatch diag

Debug PR to reproduce the flaky spmd_paged_attention_highperf onboard
'out' golden mismatch under CI multi-core concurrency (it does not
reproduce at <=2-card local concurrency). CI st-onboard-a2a3 is EXPECTED
to go red — this PR is for diagnosis, not merge.

- Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063).
- _compare_outputs: on mismatch, print the worst element's multi-dim
  index and a per-row (flattened leading dims, e.g. per-head)
  over-tolerance breakdown. Random bad rows across reruns => concurrency
  race; a fixed set => deterministic per-row bug. Best-effort, never
  raises, so it cannot mask the real AssertionError.
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8a7b441-64e8-46ab-b20e-bde68b26ee0c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChaoZheng109 ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch 2 times, most recently from 8cc2a74 to 68c938b Compare June 17, 2026 09:34
… aliasing flag 3)

The highperf paged-attention decode kernel reused FFTS flag id 3 for two
distinct cross-core barriers: QK_READY_DECODER (constexpr 3) in the decode
pipeline, and the hardcoded reduce_flag_id=3 in the split-KV CombineScale
reduce. FFTS semaphores are saturating counters with no identity, so the
two uses alias onto one hardware flag: the reduce's wait_flag_dev can be
released early by the decode path's set, before the per-core partials' GM
writes are globally visible -> flaky all-heads 'out' golden mismatch with
run-to-run varying magnitude.

Latent since the kernel was written; exposed onboard once simpler#1056
enlarged the dispatch payload and tightened AICPU->AICore dispatch timing
enough to overlap the two flag-3 uses. Two flush experiments (host device
sync, AICore exit pipe_barrier) did NOT fix it, precisely because the data
is wrongly computed (early release), not unflushed.

Move the reduce to a free flag (9; 0-8 used by the pipeline, 11-15
reserved by PTO-ISA) and wait on the same id rather than a hardcoded 3.
@ChaoZheng109 ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch from 1654840 to 68c938b Compare June 17, 2026 11:20
@ChaoZheng109

Copy link
Copy Markdown
Collaborator Author

Investigation done — onboard golden mismatch root-caused to a latent cross-core race in the highperf decode kernel exposed by #1056's dispatch-timing change; full analysis + 4 ruled-out hypotheses posted on #1070. Closing this draft (no fix; not a merge candidate). Branch investigate-1070-highperf-acc kept for repro + the flag-id latent-bug fix.

The highperf paged-attention decode pipeline hands GM tiles between AIC
and AIV cores using ffts_cross_core_sync (FFTS mode2) as the only
producer->consumer barrier. mode2 orders the signal against the producing
pipe but does NOT imply a DDR fence (confirmed by the kernel owner), so a
consumer core could observe the READY flag before the producer's GM tile
was globally visible. Once hw-native-sys#1056 enlarged the
dispatch payload and tightened AICPU->AICore dispatch timing, the
producer write and consumer read overlapped -> flaky all-heads 'out'
golden mismatch onboard a2a3 with run-to-run-varying magnitude (~33%),
plus the 507018 device-poison cascade.

Flush each producer's GM tile to DDR before signaling, via a new
FlushGmBeforeCrossCoreSignal helper applied to all six decode handoffs
(QK / SOFTMAX / UPDATE _READY_DECODER and their _STAGE2 ping-pong twins):
pipe_barrier drains the producing pipe, dcci(CACHELINE_OUT) writes back
the data-cache (MTE3) stores, and dsb(DSB_DDR) is the barrier that also
orders the FIXPIPE GM stores. Mirrors PTO-ISA's own SyncAll / TSyncCVID
flush-before-signal idiom; the __CPU_SIM dsb(0) spelling matches the
existing repo idiom (pto_async_kernel_api.h, deferred_notify_demo).

Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063).
Passes on a2a3sim; onboard a2a3 validated by multiple st-onboard-a2a3 CI
rounds (the race only surfaces under CI multi-core concurrency).
@ChaoZheng109 ChaoZheng109 reopened this Jun 18, 2026
@ChaoZheng109 ChaoZheng109 changed the title [investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch) Jun 18, 2026
@ChaoZheng109

Copy link
Copy Markdown
Collaborator Author

⚠️ Onboard CI: producer-side DDR flush is necessary but NOT sufficient — race still reproduces

Ran 5 st-onboard-a2a3 rounds on this branch (run 27763273991, attempts 1–5):

round st-onboard-a2a3 highperf b1 out
1 ✅ pass PASSED
2 ✅ pass PASSED
3 fail FAILEDGolden mismatch on 'out': max_diff=0.39404296875
4 ✅ pass PASSED
5 ✅ pass PASSED

Attempt-3 failing line (via gh api repos/.../actions/jobs/82147513366/logs):

[gw2] [92%] FAILED .../spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py::...::test_run
  AssertionError: Golden mismatch on 'out': max_diff=0.39404296875, rtol=0.005, atol=0.02
--- L2 tensormap_and_ringbuffer: FAIL rc=1 83.0s ---

max_diff=0.394 is the exact original #1070 signature — so the flush-before-signal on the 6 decode FFTS handoffs reduced but did not eliminate the cross-core race (~1/5 ≈ 20%, vs. the originally-reported ~33%; within noise of "no change").

Most likely remaining gap

PTO-ISA's own SyncAll fences both sides — producer dcci(CACHELINE_OUT)+dsb and consumer dcci(SINGLE_CACHE_LINE)+dsb (invalidate) before the cross-core read. This PR only added the producer half. The consumers (SoftmaxStage1 reads s_gm; ProcessPV reads p_gm; SoftmaxStage2 reads o_tmp_gm) do no cache-invalidate before reading, so a consumer core can still read a stale cached copy even after the producer flushed to DDR. Likely next experiment: add a consumer-side dcci-invalidate + dsb right after each wait_flag_dev(*_READY_DECODER), before the GM read.

Not ready to merge. Keeping open; the case should stay off the onboard run until the race is actually closed.

…S handoffs

The producer-side DDR flush alone (prior commit) left the all-heads 'out'
golden mismatch reproducing ~1/5 onboard a2a3 (run 27763273991 attempt 3:
max_diff=0.394, the original hw-native-sys#1070 signature). Producer flush writes the
GM tile back to DDR, but the consumer core can still read a stale copy
from its own data cache.

Add the consumer half, mirroring PTO-ISA SyncAll's two-sided dcci/dsb:
after each wait_flag_dev(*_READY_DECODER) and before the GM read,
InvalidateGmAfterCrossCoreWait drops the local data cache for the
producer's tile (dcci invalidate + dsb(DSB_DDR) + pipe_barrier). Applied
to all five decode consumer waits: AIV reads s_gm (QK_READY_DECODER /
_STAGE2), cube reads p_gm (SOFTMAX_READY via softmax_ready_flag), AIV
reads o_tmp_gm (UPDATE_READY_DECODER / _STAGE2).

Passes on a2a3sim; onboard a2a3 to be validated by multiple
st-onboard-a2a3 rounds.
@MirkoDeVita98

Copy link
Copy Markdown
Contributor

@ChaoZheng109 Fix for long sequences paged attention highperf: #1091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] spmd_paged_attention_highperf b1_h32_kv8_s128_bs128_fp16 regressed: sim scheduler stall (-100) + onboard golden mismatch

2 participants