Fix highperf paged attention long sequences#1091
Conversation
📝 WalkthroughWalkthroughAdds a ChangesSPMD Paged Attention High-Perf: Kernel, Tiling, and Tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several updates to the paged attention high-performance kernel, including the addition of a DDR barrier synchronization function (DdrBarrierBeforeFfts), replacing hardcoded flag IDs with constants, changing mask application conditions, updating tiling logic, and adding new test cases. Feedback on these changes highlights a critical synchronization issue in DdrBarrierBeforeFfts where the pipeline barrier and data synchronization barrier (dsb) are inverted, potentially causing race conditions or hangs. Additionally, replacing the mask_gm != nullptr check with max_context_len != 0 introduces a risk of null pointer dereferences, so it is recommended to check both conditions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| inline __aicore__ void DdrBarrierBeforeFfts() | ||
| { | ||
| #if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__) | ||
| #if defined(__CPU_SIM) | ||
| dsb(0); | ||
| #else | ||
| dsb(DSB_DDR); | ||
| #endif | ||
| pipe_barrier(PIPE_ALL); | ||
| #endif | ||
| } |
There was a problem hiding this comment.
The synchronization sequence in DdrBarrierBeforeFfts is inverted. To ensure that all memory writes from the AI Core pipelines are fully committed and visible in DDR before cross-core synchronization (FFTS) begins, you must first call pipe_barrier(PIPE_ALL) to empty the pipelines, and then call dsb(DSB_DDR) to flush the memory subsystem. Calling dsb before pipe_barrier can lead to race conditions, data corruption, or intermittent hangs on hardware.
inline __aicore__ void DdrBarrierBeforeFfts()
{
#if defined(__CCE_KT_TEST__) || defined(__CCE_AICORE__) || defined(__DAV_C220__)
pipe_barrier(PIPE_ALL);
#if defined(__CPU_SIM)
dsb(0);
#else
dsb(DSB_DDR);
#endif
#endif
}
| pipe_barrier(PIPE_V); | ||
|
|
||
| if (mask_gm != nullptr) { | ||
| if (max_context_len != 0) { |
| 0 // dstGap | ||
| ); | ||
| if (mask_gm != nullptr){ | ||
| if (max_context_len != 0){ |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py (1)
204-299: 💤 Low valueConsider adding test coverage for batch=2 with kv_seq=6144 and kv_seq=16384.
The current test matrix includes batch=2 variants only for kv_seq=4096 and kv_seq=8192. For completeness, consider adding batch=2 test cases for the 6144 and 16384 sequence lengths to ensure multi-batch handling works correctly at all tested sizes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py` around lines 204 - 299, Add two new test case dictionaries to complete the batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and kv_seq=16384. Follow the same structure as the existing test dictionaries (with manual=True, platforms, config, and params sections), using the kv_seq values from the corresponding batch=1 variants while updating the name and batch parameter to 2.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce`:
- Around line 2319-2322: The mask pointer stored in mask_gm_tensor is being
accessed in the AddMask call without a null check, relying only on the
max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing
it is non-null, you need to gate both the mask_gm_tensor access in AddMask and
the corresponding MTE3 wait_flag call on an additional has_mask predicate
condition alongside the max_context_len check. Apply this same fix to both the
code block around lines 2319-2322 and the similar block around lines 2419-2421
to ensure the mask operations only execute when the mask is actually available.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`:
- Around line 225-226: Locate the SPLITHEAD_RATIO constant defined at line 70
and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant
entirely if it is dead code, or integrate it into the tiling threshold logic by
modifying the condition in the block containing the comparison against block_dim
to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code
reflects the intended design for the tiling threshold calculation.
---
Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py`:
- Around line 204-299: Add two new test case dictionaries to complete the
batch=2 test coverage. Create test cases for b2_h32_kv8_s6144_bs128_fp16 with
batch=2 and kv_seq=6144, and b2_h32_kv8_s16384_bs128_fp16 with batch=2 and
kv_seq=16384. Follow the same structure as the existing test dictionaries (with
manual=True, platforms, config, and params sections), using the kv_seq values
from the corresponding batch=1 variants while updating the name and batch
parameter to 2.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6d220bbd-f32e-4733-b67c-e5791a9382b9
📒 Files selected for processing (3)
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.ccetests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/test_spmd_paged_attention_highperf.py
| if (max_context_len != 0) { | ||
| wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); | ||
| wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); | ||
| AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset); |
There was a problem hiding this comment.
Keep the mask pointer in the mask gate.
SetArgs stores mask_in_gm without a non-null guarantee, but this guard now lets AddMask read mask_gm_tensor whenever max_context_len is non-zero. Gate both the mask copy and its matching MTE3 flag on the same has_mask predicate.
Proposed fix
- if (max_context_len != 0) {
+ const bool has_mask = (mask_gm_tensor != nullptr) && (max_context_len != 0);
+ if (has_mask) {
wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
AddMask(mask_gm_tensor, mask_ubuf_tensor, mask32_ubuf_tensor, sub_m, qk_n, qk_round_n, mask_offset);
pipe_barrier(PIPE_V);
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
}
@@
- if (max_context_len != 0){
+ if (has_mask) {
set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
}Also applies to: 2419-2421
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/kernel/pa_kernel.cce`
around lines 2319 - 2322, The mask pointer stored in mask_gm_tensor is being
accessed in the AddMask call without a null check, relying only on the
max_context_len condition. Since SetArgs stores mask_in_gm without guaranteeing
it is non-null, you need to gate both the mask_gm_tensor access in AddMask and
the corresponding MTE3 wait_flag call on an additional has_mask predicate
condition alongside the max_context_len check. Apply this same fix to both the
code block around lines 2319-2322 and the similar block around lines 2419-2421
to ensure the mask operations only execute when the mask is actually available.
| if decoder_batch * kv_split_core_num < block_dim: | ||
| core_per_kv = _ceil_div(block_dim, decoder_batch * kv_split_core_num) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify SPLITHEAD_RATIO constant definition and usage
# Check if SPLITHEAD_RATIO is defined in this file or imported
rg -n 'SPLITHEAD_RATIO' --type=pyRepository: hw-native-sys/simpler
Length of output: 180
🏁 Script executed:
# Read the file around the SPLITHEAD_RATIO definition and the condition in question
head -n 80 tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | tail -n 15
echo "---"
# Read around lines 225-226
sed -n '215,235p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py
echo "---"
# Check all occurrences of SPLITHEAD_RATIO in the file
rg -n 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py -A 2 -B 2Repository: hw-native-sys/simpler
Length of output: 1371
🏁 Script executed:
# Get full file size and structure to understand context better
wc -l tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py
echo "---"
# Check what function contains lines 225-226
sed -n '200,240p' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py | head -n 40
echo "---"
# Search for any remaining usage of SPLITHEAD_RATIO in entire file (double-check)
rg 'SPLITHEAD_RATIO' tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.pyRepository: hw-native-sys/simpler
Length of output: 1487
Remove the orphaned SPLITHEAD_RATIO constant or integrate it into the condition at line 225.
SPLITHEAD_RATIO is defined at line 70 but never used anywhere in the file. The condition at lines 225-226 currently compares against block_dim directly rather than block_dim * SPLITHEAD_RATIO. Either this constant is dead code that should be removed, or it should be intentionally integrated into the tiling threshold logic. Clarify the intent and clean up accordingly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf/kernels/pa_tiling.py`
around lines 225 - 226, Locate the SPLITHEAD_RATIO constant defined at line 70
and determine its intended purpose. Either remove the SPLITHEAD_RATIO constant
entirely if it is dead code, or integrate it into the tiling threshold logic by
modifying the condition in the block containing the comparison against block_dim
to use block_dim * SPLITHEAD_RATIO instead of just block_dim. Ensure the code
reflects the intended design for the tiling threshold calculation.
No description provided.