Fix highperf paged attention long sequences by MirkoDeVita98 · Pull Request #1091 · hw-native-sys/simpler

MirkoDeVita98 · 2026-06-19T07:54:14Z

No description provided.

coderabbitai · 2026-06-19T07:54:28Z

📝 Walkthrough

Walkthrough

Adds a DdrBarrierBeforeFfts() helper and REDUCE_READY_DECODER constant to the paged attention kernel, then inserts these barriers at all FFT cross-core synchronization points in the pipeline. Two masking guards in SoftmaxStage1 are changed from pointer-null checks to max_context_len != 0. Two tiling thresholds are adjusted, and new test cases cover large kv_seq values up to 16384.

Changes

SPMD Paged Attention High-Perf: Kernel, Tiling, and Tests

Layer / File(s)	Summary
DdrBarrierBeforeFfts helper and REDUCE_READY_DECODER constant `tests/st/a2a3/.../kernels/kernel/pa_kernel.cce`	Defines `DdrBarrierBeforeFfts()` (dsb + `pipe_barrier(PIPE_ALL)`) and the `REDUCE_READY_DECODER = 13` FFT-readiness flag constant.
DDR barrier insertions at FFT pipeline transition points `tests/st/a2a3/.../kernels/kernel/pa_kernel.cce`	Inserts `pipe_barrier(PIPE_FIX)` + `DdrBarrierBeforeFfts()` before the four decoder/stage2 `FftsCrossCoreSync` calls, adds `pipe_barrier(PIPE_MTE3)` + `DdrBarrierBeforeFfts()` before two vector inner-loop FFT syncs, and replaces a hardcoded flag id with `REDUCE_READY_DECODER` in the C220_VEC reduction path.
SoftmaxStage1 masking guard change `tests/st/a2a3/.../kernels/kernel/pa_kernel.cce`	Changes both masking conditional guards in `SoftmaxStage1` from `mask_gm != nullptr` to `max_context_len != 0`.
Tiling threshold adjustments `tests/st/a2a3/.../kernels/pa_tiling.py`	In `_split_core_bns_nd`, changes the `core_per_kv` recomputation trigger to compare against `block_dim` (removing `SPLITHEAD_RATIO`). In `make_pa_nd_decode_tiling`, fixes `is_long_seq` to `max_kv >= KV_SEQLEN_SLICE_512 * 8`.
New large-kv_seq test cases `tests/st/a2a3/.../test_spmd_paged_attention_highperf.py`	Extends the `CASES` array with a2a3/a2a3sim cases covering kv_seq 4096, 6144, 8192, and 16384 for batch sizes 1 and 2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hw-native-sys/simpler#899: Introduced the high-performance paged attention scaffold including pa_tiling.py functions _split_core_bns_nd and make_pa_nd_decode_tiling that are directly modified in this PR.

Poem

🐇 Hop hop, the barriers are set,
Before each FFT, no sync to forget!
The masking now checks context_len,
And tiling thresholds shift again.
With kv_seq stretched to sixteen-k wide,
This rabbit's pipeline runs with pride! 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	No pull request description was provided by the author, making it impossible to assess relevance to the changeset.	Add a description explaining the specific issues being fixed, the changes made, and their impact on long sequence handling in paged attention.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix highperf paged attention long sequences' directly addresses the main objective of the pull request, which focuses on fixing long sequence handling in the high-performance paged attention implementation across kernel, tiling, and test files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces several updates to the paged attention high-performance kernel, including the addition of a DDR barrier synchronization function (DdrBarrierBeforeFfts), replacing hardcoded flag IDs with constants, changing mask application conditions, updating tiling logic, and adding new test cases. Feedback on these changes highlights a critical synchronization issue in DdrBarrierBeforeFfts where the pipeline barrier and data synchronization barrier (dsb) are inverted, potentially causing race conditions or hangs. Additionally, replacing the mask_gm != nullptr check with max_context_len != 0 introduces a risk of null pointer dereferences, so it is recommended to check both conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T07:56:26Z

+inline __aicore__ void DdrBarrierBeforeFfts()
+{
+#if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__)
+#if defined(__CPU_SIM)
+    dsb(0);
+#else
+    dsb(DSB_DDR);
+#endif
+    pipe_barrier(PIPE_ALL);
+#endif
+}


The synchronization sequence in DdrBarrierBeforeFfts is inverted. To ensure that all memory writes from the AI Core pipelines are fully committed and visible in DDR before cross-core synchronization (FFTS) begins, you must first call pipe_barrier(PIPE_ALL) to empty the pipelines, and then call dsb(DSB_DDR) to flush the memory subsystem. Calling dsb before pipe_barrier can lead to race conditions, data corruption, or intermittent hangs on hardware.

inline __aicore__ void DdrBarrierBeforeFfts() { #if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__) pipe_barrier(PIPE_ALL); #if defined(__CPU_SIM) dsb(0); #else dsb(DSB_DDR); #endif #endif }

gemini-code-assist · 2026-06-19T07:56:26Z

        pipe_barrier(PIPE_V);

-        if (mask_gm != nullptr) {
+        if (max_context_len != 0) {


To prevent potential null pointer dereferences, it is safer to check both max_context_len != 0 and mask_gm != nullptr before executing the AddMask block.

if (max_context_len != 0 && mask_gm != nullptr) {

gemini-code-assist · 2026-06-19T07:56:26Z

            0                         // dstGap
        );
-        if (mask_gm != nullptr){
+        if (max_context_len != 0){


To prevent potential null pointer dereferences, it is safer to check both max_context_len != 0 and mask_gm != nullptr before setting the flag.

if (max_context_len != 0 && mask_gm != nullptr){

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py (1)
204-299: 💤 Low value

Consider adding test coverage for batch=2 with kv_seq=6144 and kv_seq=16384.

The current test matrix includes batch=2 variants only for kv_seq=4096 and kv_seq=8192. For completeness, consider adding batch=2 test cases for the 6144 and 16384 sequence lengths to ensure multi-batch handling works correctly at all tested sizes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`
around lines 204 - 299, Add two new test case dictionaries to complete the
batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with
batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and
kv_seq=16384. Follow the same structure as the existing test dictionaries (with
manual=True, platforms, config, and params sections), using the kv_seq values
from the corresponding batch=1 variants while updating the name and batch
parameter to 2.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce`:
- Around line 2319-2322: The mask pointer stored in mask_gm_tensor is being
accessed in the AddMask call without a null check, relying only on the
max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing
it is non-null, you need to gate both the mask_gm_tensor access in AddMask and
the corresponding MTE3 wait_flag call on an additional has_mask predicate
condition alongside the max_context_len check. Apply this same fix to both the
code block around lines 2319-2322 and the similar block around lines 2419-2421
to ensure the mask operations only execute when the mask is actually available.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`:
- Around line 225-226: Locate the SPLITHEAD_RATIO constant defined at line 70
and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant
entirely if it is dead code, or integrate it into the tiling threshold logic by
modifying the condition in the block containing the comparison against block_dim
to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code
reflects the intended design for the tiling threshold calculation.

---

Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`:
- Around line 204-299: Add two new test case dictionaries to complete the
batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with
batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and
kv_seq=16384. Follow the same structure as the existing test dictionaries (with
manual=True, platforms, config, and params sections), using the kv_seq values
from the corresponding batch=1 variants while updating the name and batch
parameter to 2.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d220bbd-f32e-4733-b67c-e5791a9382b9

📥 Commits

Reviewing files that changed from the base of the PR and between cdbea27 and b3324e8.

📒 Files selected for processing (3)

tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py

coderabbitai · 2026-06-19T07:59:15Z

+        if (max_context_len != 0) {
            wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
            wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
            AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the mask pointer in the mask gate.

SetArgs stores mask_in_gm without a non-null guarantee, but this guard now lets AddMask read mask_gm_tensor whenever max_context_len is non-zero. Gate both the mask copy and its matching MTE3 flag on the same has_mask predicate.

Proposed fix

- if (max_context_len != 0) { + const bool has_mask = (mask_gm_tensor != nullptr) && (max_context_len != 0); + if (has_mask) { wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset); pipe_barrier(PIPE_V); set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); } @@ - if (max_context_len != 0){ + if (has_mask) { set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); }

Also applies to: 2419-2421

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce` around lines 2319 - 2322, The mask pointer stored in mask_gm_tensor is being accessed in the AddMask call without a null check, relying only on the max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing it is non-null, you need to gate both the mask_gm_tensor access in AddMask and the corresponding MTE3 wait_flag call on an additional has_mask predicate condition alongside the max_context_len check. Apply this same fix to both the code block around lines 2319-2322 and the similar block around lines 2419-2421 to ensure the mask operations only execute when the mask is actually available.

coderabbitai · 2026-06-19T07:59:15Z

+    if decoder_batch * kv_split_core_num < block_dim:
        core_per_kv = _ceil_div(block_dim, decoder_batch * kv_split_core_num)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify SPLITHEAD_RATIO constant definition and usage # Check if SPLITHEAD_RATIO is defined in this file or imported rg -n 'SPLITHEAD_RATIO' --type=py

Repository: hw-native-sys/simpler

Length of output: 180

🏁 Script executed:

# Read the file around the SPLITHEAD_RATIO definition and the condition in question head -n 80 tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | tail -n 15 echo "---" # Read around lines 225-226 sed -n '215,235p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py echo "---" # Check all occurrences of SPLITHEAD_RATIO in the file rg -n 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py -A 2 -B 2

Repository: hw-native-sys/simpler

Length of output: 1371

🏁 Script executed:

# Get full file size and structure to understand context better wc -l tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py echo "---" # Check what function contains lines 225-226 sed -n '200,240p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | head -n 40 echo "---" # Search for any remaining usage of SPLITHEAD_RATIO in entire file (double-check) rg 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py

Repository: hw-native-sys/simpler

Length of output: 1487

Remove the orphaned SPLITHEAD_RATIO constant or integrate it into the condition at line 225.

SPLITHEAD_RATIO is defined at line 70 but never used anywhere in the file. The condition at lines 225-226 currently compares against block_dim directly rather than block_dim * SPLITHEAD_RATIO. Either this constant is dead code that should be removed, or it should be intentionally integrated into the tiling threshold logic. Clarify the intent and clean up accordingly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py` around lines 225 - 226, Locate the SPLITHEAD_RATIO constant defined at line 70 and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant entirely if it is dead code, or integrate it into the tiling threshold logic by modifying the condition in the block containing the comparison against block_dim to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code reflects the intended design for the tiling threshold calculation.

Fix highperf paged attention long sequences

b3324e8

MirkoDeVita98 mentioned this pull request Jun 19, 2026

fix(#1070): DDR fence before decode FFTS cross-core signals (onboard a2a3 golden mismatch) #1076

Draft

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

MirkoDeVita98 mentioned this pull request Jun 19, 2026

[Bug] SPMD Paged Attention Highperf Fails For >= 8192 sequences #1022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix highperf paged attention long sequences#1091

Fix highperf paged attention long sequences#1091
MirkoDeVita98 wants to merge 1 commit into
hw-native-sys:mainfrom
MirkoDeVita98:spmd-paged-attention-highperf-only

MirkoDeVita98 commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if decoder_batch * kv_split_core_num < block_dim:
		core_per_kv = _ceil_div(block_dim, decoder_batch * kv_split_core_num)

Conversation

MirkoDeVita98 commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading