Skip to content

Fix highperf paged attention long sequences#1091

Open
MirkoDeVita98 wants to merge 1 commit into
hw-native-sys:mainfrom
MirkoDeVita98:spmd-paged-attention-highperf-only
Open

Fix highperf paged attention long sequences#1091
MirkoDeVita98 wants to merge 1 commit into
hw-native-sys:mainfrom
MirkoDeVita98:spmd-paged-attention-highperf-only

Conversation

@MirkoDeVita98

Copy link
Copy Markdown
Contributor

No description provided.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a DdrBarrierBeforeFfts() helper and REDUCE_READY_DECODER constant to the paged attention kernel, then inserts these barriers at all FFT cross-core synchronization points in the pipeline. Two masking guards in SoftmaxStage1 are changed from pointer-null checks to max_context_len != 0. Two tiling thresholds are adjusted, and new test cases cover large kv_seq values up to 16384.

Changes

SPMD Paged Attention High-Perf: Kernel, Tiling, and Tests

Layer / File(s) Summary
DdrBarrierBeforeFfts helper and REDUCE_READY_DECODER constant
tests/st/a2a3/.../kernels/kernel/pa_kernel.cce
Defines DdrBarrierBeforeFfts() (dsb + pipe_barrier(PIPE_ALL)) and the REDUCE_READY_DECODER = 13 FFT-readiness flag constant.
DDR barrier insertions at FFT pipeline transition points
tests/st/a2a3/.../kernels/kernel/pa_kernel.cce
Inserts pipe_barrier(PIPE_FIX) + DdrBarrierBeforeFfts() before the four decoder/stage2 FftsCrossCoreSync calls, adds pipe_barrier(PIPE_MTE3) + DdrBarrierBeforeFfts() before two vector inner-loop FFT syncs, and replaces a hardcoded flag id with REDUCE_READY_DECODER in the C220_VEC reduction path.
SoftmaxStage1 masking guard change
tests/st/a2a3/.../kernels/kernel/pa_kernel.cce
Changes both masking conditional guards in SoftmaxStage1 from mask_gm != nullptr to max_context_len != 0.
Tiling threshold adjustments
tests/st/a2a3/.../kernels/pa_tiling.py
In _split_core_bns_nd, changes the core_per_kv recomputation trigger to compare against block_dim (removing SPLITHEAD_RATIO). In make_pa_nd_decode_tiling, fixes is_long_seq to max_kv >= KV_SEQLEN_SLICE_512 * 8.
New large-kv_seq test cases
tests/st/a2a3/.../test_spmd_paged_attention_highperf.py
Extends the CASES array with a2a3/a2a3sim cases covering kv_seq 4096, 6144, 8192, and 16384 for batch sizes 1 and 2.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#899: Introduced the high-performance paged attention scaffold including pa_tiling.py functions _split_core_bns_nd and make_pa_nd_decode_tiling that are directly modified in this PR.

Poem

🐇 Hop hop, the barriers are set,
Before each FFT, no sync to forget!
The masking now checks context_len,
And tiling thresholds shift again.
With kv_seq stretched to sixteen-k wide,
This rabbit's pipeline runs with pride! 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess relevance to the changeset. Add a description explaining the specific issues being fixed, the changes made, and their impact on long sequence handling in paged attention.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix highperf paged attention long sequences' directly addresses the main objective of the pull request, which focuses on fixing long sequence handling in the high-performance paged attention implementation across kernel, tiling, and test files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several updates to the paged attention high-performance kernel, including the addition of a DDR barrier synchronization function (DdrBarrierBeforeFfts), replacing hardcoded flag IDs with constants, changing mask application conditions, updating tiling logic, and adding new test cases. Feedback on these changes highlights a critical synchronization issue in DdrBarrierBeforeFfts where the pipeline barrier and data synchronization barrier (dsb) are inverted, potentially causing race conditions or hangs. Additionally, replacing the mask_gm != nullptr check with max_context_len != 0 introduces a risk of null pointer dereferences, so it is recommended to check both conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +87 to +97
inline __aicore__ void DdrBarrierBeforeFfts()
{
#if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__)
#if defined(__CPU_SIM)
dsb(0);
#else
dsb(DSB_DDR);
#endif
pipe_barrier(PIPE_ALL);
#endif
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The synchronization sequence in DdrBarrierBeforeFfts is inverted. To ensure that all memory writes from the AI Core pipelines are fully committed and visible in DDR before cross-core synchronization (FFTS) begins, you must first call pipe_barrier(PIPE_ALL) to empty the pipelines, and then call dsb(DSB_DDR) to flush the memory subsystem. Calling dsb before pipe_barrier can lead to race conditions, data corruption, or intermittent hangs on hardware.

inline __aicore__ void DdrBarrierBeforeFfts()
{
#if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__)
    pipe_barrier(PIPE_ALL);
#if defined(__CPU_SIM)
    dsb(0);
#else
    dsb(DSB_DDR);
#endif
#endif
}

pipe_barrier(PIPE_V);

if (mask_gm != nullptr) {
if (max_context_len != 0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential null pointer dereferences, it is safer to check both max_context_len != 0 and mask_gm != nullptr before executing the AddMask block.

        if (max_context_len != 0 && mask_gm != nullptr) {

0 // dstGap
);
if (mask_gm != nullptr){
if (max_context_len != 0){

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential null pointer dereferences, it is safer to check both max_context_len != 0 and mask_gm != nullptr before setting the flag.

        if (max_context_len != 0 && mask_gm != nullptr){

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py (1)

204-299: 💤 Low value

Consider adding test coverage for batch=2 with kv_seq=6144 and kv_seq=16384.

The current test matrix includes batch=2 variants only for kv_seq=4096 and kv_seq=8192. For completeness, consider adding batch=2 test cases for the 6144 and 16384 sequence lengths to ensure multi-batch handling works correctly at all tested sizes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`
around lines 204 - 299, Add two new test case dictionaries to complete the
batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with
batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and
kv_seq=16384. Follow the same structure as the existing test dictionaries (with
manual=True, platforms, config, and params sections), using the kv_seq values
from the corresponding batch=1 variants while updating the name and batch
parameter to 2.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce`:
- Around line 2319-2322: The mask pointer stored in mask_gm_tensor is being
accessed in the AddMask call without a null check, relying only on the
max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing
it is non-null, you need to gate both the mask_gm_tensor access in AddMask and
the corresponding MTE3 wait_flag call on an additional has_mask predicate
condition alongside the max_context_len check. Apply this same fix to both the
code block around lines 2319-2322 and the similar block around lines 2419-2421
to ensure the mask operations only execute when the mask is actually available.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`:
- Around line 225-226: Locate the SPLITHEAD_RATIO constant defined at line 70
and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant
entirely if it is dead code, or integrate it into the tiling threshold logic by
modifying the condition in the block containing the comparison against block_dim
to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code
reflects the intended design for the tiling threshold calculation.

---

Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`:
- Around line 204-299: Add two new test case dictionaries to complete the
batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with
batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and
kv_seq=16384. Follow the same structure as the existing test dictionaries (with
manual=True, platforms, config, and params sections), using the kv_seq values
from the corresponding batch=1 variants while updating the name and batch
parameter to 2.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d220bbd-f32e-4733-b67c-e5791a9382b9

📥 Commits

Reviewing files that changed from the base of the PR and between cdbea27 and b3324e8.

📒 Files selected for processing (3)
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py

Comment on lines +2319 to 2322
if (max_context_len != 0) {
wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the mask pointer in the mask gate.

SetArgs stores mask_in_gm without a non-null guarantee, but this guard now lets AddMask read mask_gm_tensor whenever max_context_len is non-zero. Gate both the mask copy and its matching MTE3 flag on the same has_mask predicate.

Proposed fix
-        if (max_context_len != 0) {
+        const bool has_mask = (mask_gm_tensor != nullptr) && (max_context_len != 0);
+        if (has_mask) {
             wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
             wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
             AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset);
             pipe_barrier(PIPE_V);
             set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
         }
@@
-        if (max_context_len != 0){
+        if (has_mask) {
             set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
         }

Also applies to: 2419-2421

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce`
around lines 2319 - 2322, The mask pointer stored in mask_gm_tensor is being
accessed in the AddMask call without a null check, relying only on the
max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing
it is non-null, you need to gate both the mask_gm_tensor access in AddMask and
the corresponding MTE3 wait_flag call on an additional has_mask predicate
condition alongside the max_context_len check. Apply this same fix to both the
code block around lines 2319-2322 and the similar block around lines 2419-2421
to ensure the mask operations only execute when the mask is actually available.

Comment on lines +225 to 226
if decoder_batch * kv_split_core_num < block_dim:
core_per_kv = _ceil_div(block_dim, decoder_batch * kv_split_core_num)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify SPLITHEAD_RATIO constant definition and usage

# Check if SPLITHEAD_RATIO is defined in this file or imported
rg -n 'SPLITHEAD_RATIO' --type=py

Repository: hw-native-sys/simpler

Length of output: 180


🏁 Script executed:

# Read the file around the SPLITHEAD_RATIO definition and the condition in question
head -n 80 tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | tail -n 15

echo "---"

# Read around lines 225-226
sed -n '215,235p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py

echo "---"

# Check all occurrences of SPLITHEAD_RATIO in the file
rg -n 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py -A 2 -B 2

Repository: hw-native-sys/simpler

Length of output: 1371


🏁 Script executed:

# Get full file size and structure to understand context better
wc -l tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py

echo "---"

# Check what function contains lines 225-226
sed -n '200,240p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | head -n 40

echo "---"

# Search for any remaining usage of SPLITHEAD_RATIO in entire file (double-check)
rg 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py

Repository: hw-native-sys/simpler

Length of output: 1487


Remove the orphaned SPLITHEAD_RATIO constant or integrate it into the condition at line 225.

SPLITHEAD_RATIO is defined at line 70 but never used anywhere in the file. The condition at lines 225-226 currently compares against block_dim directly rather than block_dim * SPLITHEAD_RATIO. Either this constant is dead code that should be removed, or it should be intentionally integrated into the tiling threshold logic. Clarify the intent and clean up accordingly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`
around lines 225 - 226, Locate the SPLITHEAD_RATIO constant defined at line 70
and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant
entirely if it is dead code, or integrate it into the tiling threshold logic by
modifying the condition in the block containing the comparison against block_dim
to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code
reflects the intended design for the tiling threshold calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant