Align tmp contracts for tile ops#824
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates several PTO IR operations to relax constraints on the tmp (temporary workspace) operand, particularly on the A5 architecture where it is treated as a placeholder that does not need to match the shape or layout of the source/destination tiles. This prevents false vector overflow diagnostics and updates memory effects. The review feedback points out a mismatch in the expected error message in a new test file and recommends completely removing the unused verifyTRowReductionNoTmpCommon helper function in PTO.cpp to avoid dead code.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| } | ||
| } | ||
|
|
||
| // CHECK: error: 'pto.trowmax' op expects tmp to use the row_major blayout |
There was a problem hiding this comment.
The expected error message in the CHECK directive does not match the actual error message emitted by the verifier in lib/PTO/IR/PTO.cpp. The verifier emits "expects tmp to use row-major layout", but the test expects "expects tmp to use the row_major blayout". This mismatch will cause the lit test to fail.
// CHECK: error: 'pto.trowmax' op expects tmp to use row-major layout
| [[maybe_unused]] static LogicalResult | ||
| verifyTRowReductionNoTmpCommon(Operation *op, Type srcTy, Type dstTy, | ||
| StringRef elemTypeError) { |
There was a problem hiding this comment.
The helper function verifyTRowReductionNoTmpCommon is now marked [[maybe_unused]] because all of its callers (TRowMaxOp and TRowSumOp) have been migrated to use verifyTRowReductionWithTmpCommon. Since this is a static helper function in this translation unit and is no longer used anywhere, it should be completely removed from the codebase to avoid dead code.
Codex Review该评论由 review 机器人自动更新。
SummaryPR #824 introduces a backend contract mismatch for A5 Findings
This example still says it reconstructs |
b63bf80 to
e9b88fa
Compare
Signed-off-by: FangRui <fangrui_95@163.com>
|
/run a3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
|
/run a3 |
zhangstevenunity
left a comment
There was a problem hiding this comment.
Review: Align tmp contracts for tile ops
The A5 "tmp is an unused placeholder" modeling (skip the scratch WRITE in getEffects to dodge false vec-overflow) is sound for the ops whose A5 kernel genuinely ignores tmp: tsel/tsels/txor/txors TileLang templates never write tmp, and the new A5 txors verifier checks are correctly A2/A3-gated. But two changes look like correctness regressions, and one needs confirmation.
Requesting changes
1 (blocking). A5 trowexpand* + tmp is now verifier-legal but cannot be lowered on EITHER A5 backend.
This PR deletes the hasTmp && A5 -> "expects A5 form to omit tmp" guard in verifyTRowExpandReduceLikeOp. That guard's own comment said it exists so a 0x0 dst cannot "let an A5 tmp form slip through and lower to the A2/A3 4-operand TROWEXPAND* call." Removing it re-enables exactly that:
--pto-backend=vpto: the A5 TileLang templates for this family are 3-param(src0, src1, dst)(lib/TileOps/trowexpandmax_template.py). ExpandTileOp builds one operand-spec per operand (buildSpecKey over all operands) and expand_helper rejects any descriptor wherelen(parameters) != len(operand_specs)(expand_helper.py:335), so a 4-operand op finds no kernel and ExpandTileOp hard-fails: "failed to instantiate tilelang template for pto.trowexpandmax" (ExpandTileOp.cpp:1204). This is the Codex bot's P2.- default EmitC: PTORowExpand*ToEmitC forwards tmp as a 4th arg unconditionally, emitting
TROWEXPANDMAX(dst, src0, src1, tmp). The A5 board kernels are 3-operand (test/.../a5/.../trowexpandmax/main.cpp declaresLaunchTROWEXPANDMAX_*(src0, src1, dst, stream); the A5 ST .pto usesins(%src0, %src1)), matching the pre-PR "A5 supports the 3-operand form only" note this PR removed. So the 4-arg call has no A5 ISA overload.
The new tests do not catch this: issue708_* and tmp_contract_a5_non_same_shape use --emit-pto-ir (stops before lowering), and a5_unused_tmp_vec_overflow only FileChecks emitted text, never compiling for the board. Please either keep rejecting tmp on A5 for trowexpand*, or give the vpto templates a tmp param (like trowmax) AND confirm a 4-operand A5 EmitC/ISA path exists, backed by a real board test.
2 (blocking). The WithTmp relaxation silently loosens A2/A3 trowmin/trowprod, where tmp is real scratch.
verifyTRowReductionWithTmpCommon drops verifyTileBufSameElemType + verifyTileBufSameValidShape (and downgrades verifyVecTileCommon -> verifyVecTileStorage) for ALL arches. On master, trowmin/trowprod already used this helper (only trowmax/trowsum are newly migrated from the NoTmp helper). On A2/A3 these forward tmp to TROWMIN/TROWPROD(dst, src, tmp) as genuine scratch, and the deleted test trowprod_tmp_mismatch_invalid.py asserted that a mismatched tmp must fail. Now an undersized / wrong-dtype tmp passes verification on A2/A3 and reaches the intrinsic -> possible scratch overflow / type reinterpret. The PR's intent is A5-placeholder semantics, so please A5-gate the relaxation (keep the strict checks on A2/A3), or justify why A2/A3 reductions tolerate a non-matching tmp.
Please confirm
3. Row-reduction getEffects drop the A5 tmp WRITE with no justification comment. For trowmax/trowmin/trowsum/trowprod, EmitC still forwards tmp to TROWMAX(dst, src, tmp) on A5. Dropping the WRITE is correct only if the A5 ISA truly does not write tmp for these (the vpto template ignoring tmp is encouraging but does not cover the EmitC path). If any A5 reduction writes tmp as scratch, InsertSync now misses a WAR/WAW on it. tsel/txor got an explanatory comment; please add the same here and confirm against the A5 ISA.
Nits
- ptodsl/examples/softmax_lowlevel.py still says it reconstructs softmax/softmax.pto, but flips tile_col to row_major while that kernel keeps blayout=col_major for the 8x1 reduction tiles (Codex P3).
- PTO.cpp ~10313: the comment "(A5 tmp-form invariant is checked earlier ...)" is now stale -- the invariant it refers to was deleted in this PR.
- verifyTRowReductionNoTmpCommon now has zero callers; it is marked [[maybe_unused]] rather than removed (gemini flagged this) -- please delete it.
| let summary = "TROWEXPANDMAX: Row-wise broadcast max with per-row scalar vector."; | ||
| let description = [{ | ||
| pto-isa has overloads with/without tmp on A2/A3; A5 supports the 3-operand form only. | ||
| pto-isa has overloads with/without tmp; on A5 the tmp operand is accepted as a placeholder and does not add extra shape constraints. |
There was a problem hiding this comment.
A5 trowexpand* with tmp is now verifier-legal but un-lowerable on both A5 backends.
vpto: the A5 TileLang template is 3-param (src0, src1, dst) (lib/TileOps/trowexpandmax_template.py). ExpandTileOp emits one operand-spec per operand and expand_helper drops any kernel where len(parameters) != len(operand_specs) (expand_helper.py:335), so a 4-operand op hard-fails ExpandTileOp: "failed to instantiate tilelang template for pto.trowexpandmax" (ExpandTileOp.cpp:1204).
EmitC (default): PTORowExpandMaxToEmitC forwards tmp as a 4th arg -> TROWEXPANDMAX(dst, src0, src1, tmp), but the A5 board kernel is 3-operand (a5/.../trowexpandmax/main.cpp: LaunchTROWEXPANDMAX_*(src0, src1, dst, stream)).
This is the bug the deleted hasTmp && A5 guard in verifyTRowExpandReduceLikeOp prevented -- its comment said the form would otherwise "lower to the A2/A3 4-operand TROWEXPAND* call." The new tests only use --emit-pto-ir or FileCheck text, so they never exercise the lowering. Suggest keeping the A5 reject, or adding a tmp-taking vpto template plus a real 4-operand A5 path with a board test.
| StringRef elemTypeError) { | ||
| if (failed(verifyRowReductionSrcLayout(op, srcTy, "src")) || | ||
| failed(verifyVecTileCommon(op, tmpTy, "tmp")) || | ||
| failed(verifyVecTileStorage(op, tmpTy, "tmp")) || |
There was a problem hiding this comment.
Dropping verifyTileBufSameElemType + verifyTileBufSameValidShape here loosens tmp checks for ALL arches, but trowmin/trowprod already used this helper on master and forward tmp to TROWMIN/TROWPROD(dst, src, tmp) as real scratch on A2/A3. The deleted trowprod_tmp_mismatch_invalid.py asserted that a mismatched tmp must fail. Now an undersized / wrong-dtype tmp passes verification on A2/A3 and reaches the intrinsic (scratch overflow / type reinterpret). Please A5-gate this relaxation so A2/A3 keep the same-elem / same-valid-shape checks, or justify why A2/A3 reductions tolerate a non-matching tmp.
| SmallVectorImpl<SideEffects::EffectInstance<MemoryEffects::Effect>> &effects) { | ||
| PTO_ADD_READ(getSrcMutable()); | ||
| PTO_ADD_WRITE(getTmpMutable()); | ||
| if (getTargetArch(getOperation()) != PTOArch::A5) |
There was a problem hiding this comment.
This drops the A5 tmp WRITE for trowmax (same change for trowmin/trowsum/trowprod), but EmitC still forwards tmp to TROWMAX(dst, src, tmp) on A5. This is safe only if the A5 ISA does not write tmp for these reductions. If it does, InsertSync now misses a WAR/WAW on tmp. tsel/txor got an explanatory comment for the identical change; please add the same here and confirm against the A5 ISA (the vpto template ignoring tmp is supportive but does not cover the EmitC path).
| _col_cfg = pto.TileBufConfigAttr.get( | ||
| pto.BLayoutAttr.get(pto.BLayout.ColMajor), | ||
| _reduce_cfg = pto.TileBufConfigAttr.get( | ||
| pto.BLayoutAttr.get(pto.BLayout.RowMajor), |
There was a problem hiding this comment.
This flips tile_col to row_major, but softmax_lowlevel.py (line 13) still says it reconstructs test/tilelang_st/npu/a5/src/st/testcase/softmax/softmax.pto, where the 8x1 reduction tiles keep blayout=col_major. The example no longer matches the referenced kernel (Codex P3). Either revert this layout or update the reference/claim.
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
zhangstevenunity
left a comment
There was a problem hiding this comment.
Correction after reading the latest pto-isa (npu/a5 + npu/a2a3)
I checked the actual A5/A2A3 intrinsic implementations and need to retract two of my three blocking points -- the A5 placeholder design in this PR is more correct than my first pass implied. Apologies for the noise.
RETRACT my point 1, EmitC half (no board break). pto-isa include/pto/npu/a5/TRowExpand{Max,Add,Mul,Sub,Div,Min,Expdif}.hpp each provide a 4-arg TROWEXPAND*_IMPL(dst, src0, src1, tmp) overload, explicitly commented "4-arg overload for cross-architecture portability with A2/A3 ... the tmp tile is accepted and ignored." So the default EmitC path compiles and runs correctly on A5 -- there is no missing 4-operand ISA overload. My EmitC board-break claim was wrong.
RETRACT my point 3 (row-reduction A5 sync is fine). npu/a5/TRowReduce.hpp (TROWMAX/TROWSUM/TROWMIN_IMPL) and npu/a5/TRowProd.hpp take tmp and never touch it -- the reduction is pure register accumulation (vregdst). So dropping the A5 tmp WRITE in getEffects is correct and the false-vec-overflow fix is sound; no missing WAR/WAW.
DOWNGRADE my point 1, vpto half (now minor / non-blocking). Only the opt-in --pto-backend=vpto path is affected: the TileLang templates lib/TileOps/trowexpand*_template.py are still 3-param (src0, src1, dst), so ExpandTileOp hard-fails (loudly) on a 4-operand op (expand_helper.py:335 -> ExpandTileOp.cpp:1204). No current frontend emits trowexpand*+tmp, so this is latent. Suggest giving those templates a tmp param like trowmax_template.py already has (mirroring the ISA's 4-arg portability overload) -- a consistency follow-up, not a blocker.
STILL STANDS: my point 2 (A2/A3 trowmin/trowprod), now confirmed against the ISA. npu/a2a3/TRowMax.hpp and TRowProd.hpp use tmp as real read/write scratch: vector_dup(tmp, ...) then vmax/vmul(tmp, tmp, src + block*elemsPerBlock), with elemsPerBlock = BLOCK_BYTE_SIZE / sizeof(T) (8 for i32, 16 for i16), and the tile pointer is cast to T*. So A2/A3 needs tmp >= one block AND tmp dtype == src dtype. The relaxed verifyTRowReductionWithTmpCommon drops BOTH the same-elem-type check and any size check for all arches, so a sub-block / wrong-dtype tmp now passes verification on A2/A3 -> silent out-of-bounds scratch write. This is reachable in exactly the cross-arch "placeholder" scenario this PR promotes: a [1,8] tmp is fine for i32 (block=8) but overflows an i16 reduction (block=16). The old same-valid-shape check was over-strict (it demanded a full-src tmp, more than the one-block minimum) but it was safe.
Recommend A5-gating the relaxation -- keep same-elem-type + a >= one-block size check on A2/A3 -- instead of dropping the checks for all arches. If trowmin/trowprod callers are guaranteed to size tmp >= one block with matching dtype, point 2 is non-blocking too.
Net: the A5 placeholder modeling is sound and ISA-backed. The only residual correctness concern is the A2/A3 scratch checks in point 2.
A3 板测失败
失败用例
|
A3 板测失败详情:PR #824orchestration_example_kernel_add
vector_example_dag_kernel_add_scalar
paged_attention_example_kernel_pv_matmul
paged_attention_example_kernel_init_inplace
vector_example_dag_kernel_add
paged_attention_example_kernel_online_update
paged_attention_example_kernel_softmax_prepare
orchestration_example_kernel_add_scalar
paged_attention_example_kernel_qk_matmul
orchestration_example_kernel_mul
vector_example_dag_kernel_mul
rowexpanddiv
prelu
plan_memory_bind_tile_alias_liveness
plan_memory_peak_exact_capacity
plan_memory_loop_no_reuse_outer_live
plan_memory_if_yield
plan_memory_loop_in_if
plan_memory_peak_8_overlapping
plan_memory_if_in_loop
plan_memory_fragmentation_hole_fit
plan_memory_for_iter_args_yield
plan_memory_no_reuse_overlap
plan_memory_reuse_sequential
plan_memory_nested_loops
|
A3 板测失败详情:PR #824plan_memory_fragmentation_two_holes
rems
xor
partition_view_verify_rank_mismatch_valid
partition_view_verify_valid
rowexpandmul
quant_asym
quant
partition5d_dynamic
partition5d
scatter
sparse_attn_test_incore_7
decode_hca_test_incore_54
attention_swa_test_incore_40
|
A3 板测失败详情:PR #824decode_swa_test_incore_40
decode_csa_test_incore_81
attention_hca_test_incore_54
attention_csa_test_refresh_incore_81
tensor_view_layout_dn
rowexpandsub
rope_kv_cache
qwen3_decode_incore_4
post_rmsnorm
qwen3_decode_incore_1
qwen3_decode_incore_10
qwen3_decode_incore_11
rmsnorm
qwen3_decode_incore_6
qwen3_decode_incore_2
qwen3_decode_incore_7
qwen3_decode_incore_5
qwen3_decode_incore_12
sels
tprefetch_async_binding
test_barrier_sync
matmul
add_double_dynamic
nested_loop_confliect
rar_optimization_test
test_dynamic_valid_shape
test_auto_sync_tail_hint
|
A3 板测失败详情:PR #824compensation_test
rem
partmin
|
No description provided.