feat(ptodsl): add flash-attention demos and mixed-backend subkernel pipeline support#816
feat(ptodsl): add flash-attention demos and mixed-backend subkernel pipeline support#816Zhendong404 wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces backend-partitioned container support, a new pass to normalize uncovered tile sections, and helper inlining for PTODSL subkernel calls. It also updates the EmitC lowering pipeline to handle pointer-like tile buffer addresses and adds several new examples and tests. The reviewer feedback focuses on improving robustness and correctness: refactoring the uncovered segment collection to prevent over-aggressive splitting, adding defensive checks for memory space attributes in EmitC lowering to avoid potential crashes, ensuring deterministic symbol resolution during peer lookups, and using a more robust check for static dimensions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static void collectUncoveredTopLevelSegments( | ||
| func::FuncOp funcOp, SmallVectorImpl<UncoveredTopLevelSegment> &segments) { | ||
| if (!funcOp || funcOp.isDeclaration() || !funcOp.getBody().hasOneBlock()) | ||
| return; | ||
|
|
||
| Block &entryBlock = funcOp.getBody().front(); | ||
| UncoveredTopLevelSegment current; | ||
|
|
||
| auto flushCurrent = [&]() { | ||
| if (!current.firstOp) | ||
| return; | ||
| segments.push_back(current); | ||
| current = {}; | ||
| }; | ||
|
|
||
| for (Operation &op : entryBlock.getOperations()) { | ||
| if (isa<func::ReturnOp>(op)) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| if (isExplicitSection(&op)) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| UncoveredTopLevelSegment opSummary = summarizeTopLevelOperation(&op); | ||
| if (!opSummary.containsTileOp) { | ||
| flushCurrent(); | ||
| continue; | ||
| } | ||
|
|
||
| if (!current.firstOp) { | ||
| current = std::move(opSummary); | ||
| continue; | ||
| } | ||
|
|
||
| std::optional<InferredSectionKind> currentKind = inferSegmentKind(current); | ||
| std::optional<InferredSectionKind> opKind = inferSegmentKind(opSummary); | ||
| bool mustSplit = current.containsNestedExplicitSection || | ||
| opSummary.containsNestedExplicitSection || !currentKind || | ||
| !opKind || *currentKind != *opKind; | ||
| if (mustSplit) { | ||
| flushCurrent(); | ||
| current = std::move(opSummary); | ||
| continue; | ||
| } | ||
|
|
||
| mergeSegmentSummary(current, opSummary); | ||
| } | ||
|
|
||
| flushCurrent(); | ||
| } |
There was a problem hiding this comment.
The current implementation of collectUncoveredTopLevelSegments flushes and splits the active segment immediately upon encountering any top-level operation that does not contain a tile op (e.g., a scalar calculation or a subkernel call). This results in unnecessary splitting of adjacent tile operations of the same kind, creating multiple small sections and potentially duplicating neutral top-level operations (like subkernel calls) into both Cube and Vector child modules during splitting.\n\nInstead, we should use a cohesive partitioning algorithm that accumulates neutral operations into the current segment, and only flushes/splits when there is an actual conflict between Vector and Cube kinds, or when hitting an explicit section/return.
static void collectUncoveredTopLevelSegments(
func::FuncOp funcOp, SmallVectorImpl<UncoveredTopLevelSegment> &segments) {
if (!funcOp || funcOp.isDeclaration() || !funcOp.getBody().hasOneBlock())
return;
Block &entryBlock = funcOp.getBody().front();
UncoveredTopLevelSegment current;
auto flushCurrent = [&]() {
if (!current.firstOp)
return;
if (current.containsTileOp)
segments.push_back(current);
current = {};
};
for (Operation &op : entryBlock.getOperations()) {
if (isa<func::ReturnOp>(op) || isExplicitSection(&op)) {
flushCurrent();
continue;
}
UncoveredTopLevelSegment opSummary = summarizeTopLevelOperation(&op);
if (opSummary.containsNestedExplicitSection) {
flushCurrent();
segments.push_back(opSummary);
continue;
}
if (!current.firstOp) {
current = std::move(opSummary);
continue;
}
std::optional<InferredSectionKind> currentKind = inferSegmentKind(current);
std::optional<InferredSectionKind> opKind = inferSegmentKind(opSummary);
if (currentKind && opKind && *currentKind != *opKind) {
flushCurrent();
current = std::move(opSummary);
} else {
mergeSegmentSummary(current, opSummary);
}
}
flushCurrent();
}| std::string qualifier = | ||
| addrSpaceQualifier(type.getMemorySpace().getAddressSpace()); |
There was a problem hiding this comment.
Calling type.getMemorySpace().getAddressSpace() directly without checking if the memory space attribute is present and is of type AddressSpaceAttr can lead to a crash or undefined behavior. It is safer to use a defensive check similar to other parts of this file.
| std::string qualifier = | |
| addrSpaceQualifier(type.getMemorySpace().getAddressSpace()); | |
| std::string qualifier = "__gm__"; | |
| if (auto ms = type.getMemorySpace()) { | |
| if (auto ptoAttr = dyn_cast<pto::AddressSpaceAttr>(ms)) | |
| qualifier = addrSpaceQualifier(ptoAttr.getAddressSpace()); | |
| } |
| if (auto ptrTy = dyn_cast<pto::PtrType>(originalCalleeArgTy)) { | ||
| elemTy = ptrTy.getElementType(); | ||
| as = ptrTy.getMemorySpace().getAddressSpace(); |
There was a problem hiding this comment.
Defensively check ptrTy.getMemorySpace() using dyn_cast_or_null<pto::AddressSpaceAttr> before calling getAddressSpace() to prevent potential crashes if the memory space attribute is null or of an unexpected type.
if (auto ptrTy = dyn_cast<pto::PtrType>(originalCalleeArgTy)) {
elemTy = ptrTy.getElementType();
if (auto asAttr = dyn_cast_or_null<pto::AddressSpaceAttr>(ptrTy.getMemorySpace()))
as = asAttr.getAddressSpace();
}| SmallVector<func::FuncOp> fallbackMatches; | ||
| outerModule.walk([&](func::FuncOp funcOp) { | ||
| auto visibility = funcOp->getAttrOfType<StringAttr>("sym_visibility"); | ||
| if (visibility && visibility.getValue() == "private") | ||
| return WalkResult::advance(); | ||
|
|
||
| StringRef symbolName = funcOp.getSymName(); | ||
| if (symbolName == target) { | ||
| fallbackMatches.clear(); | ||
| fallbackMatches.push_back(funcOp); | ||
| return WalkResult::interrupt(); | ||
| } | ||
| if (funcOp->hasAttr(kPTODSLLogicalNameAttrName) && | ||
| getPTODSLLogicalNameOrSymbolName(funcOp) == target) | ||
| fallbackMatches.push_back(funcOp); | ||
| return WalkResult::advance(); | ||
| }); |
There was a problem hiding this comment.
Interrupting the walk immediately upon finding an exact symbol name match can lead to non-deterministic or incorrect symbol resolution if multiple sibling modules contain public functions with the same name. It is safer to collect all matches (both exact and logical) without interrupting, and then verify that exactly one unique match exists.
SmallVector<func::FuncOp> fallbackMatches;
outerModule.walk([&](func::FuncOp funcOp) {
auto visibility = funcOp->getAttrOfType<StringAttr>("sym_visibility");
if (visibility && visibility.getValue() == "private")
return WalkResult::advance();
StringRef symbolName = funcOp.getSymName();
if (symbolName == target ||
(funcOp->hasAttr(kPTODSLLogicalNameAttrName) &&
getPTODSLLogicalNameOrSymbolName(funcOp) == target)) {
fallbackMatches.push_back(funcOp);
}
return WalkResult::advance();
});| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | ||
| return dim != ShapedType::kDynamic; | ||
| }); |
There was a problem hiding this comment.
Using dim != ShapedType::kDynamic to check for static dimensions is less robust than checking dim >= 0, as any negative dimension size in MLIR represents a dynamic or invalid dimension. Checking dim >= 0 prevents potential issues with other negative dimension representations.
| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | |
| return dim != ShapedType::kDynamic; | |
| }); | |
| bool isStatic = llvm::all_of(shape, [](int64_t dim) { | |
| return dim >= 0; | |
| }); |
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
4ab1ce5 to
912e855
Compare
|
/run a3 |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
A5 板测失败
失败用例
|
A5 板测失败详情:PR #816test_tmov_row_major_1x16_control_a5
test_tmov_col_major_16x1_align_a5
test_dynamic_valid_shape
test_barrier_sync
test_auto_sync_tail_hint
rmsnorm_incore_0
rar_optimization_test
nested_loop_confliect
matmul
decode_projection_incore_0
compensation_test
add_double_dynamic
rems
rem
rope_kv_cache
rmsnorm
qwen3_decode_incore_7
qwen3_decode_incore_6
qwen3_decode_incore_5
qwen3_decode_incore_4
qwen3_decode_incore_2
qwen3_decode_incore_1
qwen3_decode_incore_12
qwen3_decode_incore_11
qwen3_decode_incore_10
post_rmsnorm
vector_example_dag_kernel_mul
vector_example_dag_kernel_add_scalar
vector_example_dag_kernel_add
paged_attention_example_kernel_softmax_prepare
paged_attention_example_kernel_qk_matmul
paged_attention_example_kernel_pv_matmul
paged_attention_example_kernel_online_update
paged_attention_example_kernel_init_inplace
orchestration_example_kernel_mul
orchestration_example_kernel_add_scalar
orchestration_example_kernel_add
prelu
plan_memory_reuse_sequential
plan_memory_peak_exact_capacity
plan_memory_peak_8_overlapping
plan_memory_no_reuse_overlap
plan_memory_nested_loops
plan_memory_loop_no_reuse_outer_live
plan_memory_loop_in_if
plan_memory_if_yield
plan_memory_if_in_loop
plan_memory_fragmentation_two_holes
plan_memory_fragmentation_hole_fit
plan_memory_for_iter_args_yield
plan_memory_bind_tile_alias_liveness
partition_view_verify_valid
partition_view_verify_rank_mismatch_valid
partition5d_dynamic_a5
partition5d_a5
tensor_view_layout_dn
sparse_attn_test_incore_7
decode_swa_test_incore_40
decode_hca_test_incore_54
decode_csa_test_incore_81
attention_swa_test_incore_40
attention_hca_test_incore_54
attention_csa_test_refresh_incore_81
tbroadcast_root_binding
cmps
cmp
|
26c8e46 to
356a943
Compare
|
/run a3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A5 板测失败
日志尾部 |
A3 板测失败
日志尾部 |
… output LLVM IR format
356a943 to
68069a0
Compare
There was a problem hiding this comment.
Manual deep review (Codex bot failed on this PR)
Heads-up: the automated codex-review on this PR failed to run (exit 1 -- the upstream review API key group was disabled, HTTP 403), so there are currently no machine-generated findings. I did a manual correctness pass over the genuinely new/changed C++ and the mixed-backend flow, diffed against the true PR base (e37d41e1).
Net: the change is largely additive and the existing single-backend path looks preserved -- I did not find a release-blocking miscompile in the default path. Three concrete items below (1x P2, 2x P3), posted inline.
Inline findings
- P2
PTONormalizeUncoveredTileSections.cpp--inferSegmentKindignoresambiguousTileOps, so an unclassifiable tile op (e.g.pto.tpush/pto.tpop) co-resident with a classifiable op in an uncovered segment is silently wrapped into an inferred section instead of routing toemitSegmentInferenceError(asymmetric withinferWholeFunctionKind). - P3
driver.cpp--isBackendPartitionedContaineris a vacuousreturn true; in mixed mode stray top-level ops would be silently dropped. - P3
PTO.cpp--lookupPeerFuncAcrossContainer''s empty-fallback resolves private siblings, contradicting the visibility filter a few lines above and the driver''s public-only contract.
Non-blocking observations (not posted inline)
- Explicit entry detection. Entries now require
pto.entry/pto.kernel/hacc.entry/pto.aicore. This is the documented intent and in-repo consumers are migrated, but a module that reaches codegen with function definitions and zero recognized entries emits device code with no host stub and no diagnostic. A guarded warning would surface silent non-launch -- but note zero-entry child modules are legitimately valid in mixed-backend mode, so any such check must be scoped to callee-only children to avoid false positives. - Subkernel-call autosync pipe assumption.
ptodsl_subkernel_call_autosyncmodels the call with a single representative pipe (PIPE_V for simd/simt, PIPE_M for cube), andpto-inline-backend-helpersruns after InsertSync without re-running it. That boundary sync is correct only if the helper is pipe-homogeneous w.r.t. each tile argument (first/last effect per operand is on the representative pipe). If multi-pipe subkernel helpers are possible, an assert or a per-operand pipe derivation would close the gap; the test only exercises a PIPE_V-only helper. - Minor: the
debugIROutputRequestedlist inresolveSingleBackendomitsemitVPTOLLVMDialect(present in the other two gates). Harmless today (buildBackendInfobackstops it), but inconsistent.
Spot-checked and looks correct
EmitC PTOAS__TILE_DATA sink-after-TASSIGN (only ever moves a pure .data() read later; per-use re-materialization is dominance-safe), public-helper extern "C" linkage (matches the pre-existing declarations), generalized subview/reinterpret_cast pointer typing (no offset truncation, int64 offsets), vmulscvt lowering (llvm.hivm.vmulscvt.v128f16, operand order matches the masked+part sibling pattern), attachHIVMKernelAnnotations non-entry exclusion, and the GraphSyncSolver pto.section.* transparency change (strictly additive -- section bodies were previously not translated at all; set/wait flags still land inside the section).
| } | ||
|
|
||
| static std::optional<InferredSectionKind> | ||
| inferSegmentKind(const UncoveredTopLevelSegment &segment) { |
There was a problem hiding this comment.
P2 -- inferSegmentKind ignores ambiguousTileOps; unclassifiable tile ops are silently absorbed into an inferred section.
This decides a segment''s kind purely from vectorTileOpCount/cubeTileOpCount and never inspects segment.ambiguousTileOps. That is asymmetric with the whole-function analogue inferWholeFunctionKind, which bails when summary.ambiguousOps is non-empty. The collection side is symmetric -- inspectSegmentOperation pushes unclassifiable tile-like ops into segment.ambiguousTileOps, and emitSegmentInferenceError is even written to report them -- but because inferSegmentKind never returns nullopt on their account, that error branch is unreachable whenever a classifiable op coexists in the same segment.
Consequence: a top-level container op (e.g. scf.for) whose body holds a classifiable tile op of one kind plus an unclassifiable tile op is inferred as that one kind, and normalizeFunction -> wrapUncoveredTopLevelSegment moves the unclassifiable op into that pto.section.* with no diagnostic. After wrapping it counts as "covered" (the residual verifier does not descend into explicit sections), so the misplacement is silent.
pto.tpush/pto.tpop are concrete triggers: they are isTileLikeOp (OpPipeInterface + pto.t prefix) but classifyTileOp returns nullopt -- their $tile is a PTOPipeEntryType, so getPipe() yields PIPE_UNASSIGNED and there are no buffer-typed operands to classify by address space. A loop body doing VEC compute plus a pto.tpush, left uncovered, would be wrapped as section.vector with the push silently absorbed; if its correct ownership were the cube side, that is a wrong-core placement.
Fix (one line, mirroring the function-level guard):
static std::optional<InferredSectionKind>
inferSegmentKind(const UncoveredTopLevelSegment &segment) {
if (!segment.ambiguousTileOps.empty())
return std::nullopt; // route to emitSegmentInferenceError
if (segment.vectorTileOpCount && segment.cubeTileOpCount)
return std::nullopt;
...No lit test currently covers an unclassifiable tile op co-resident with a classifiable one inside an uncovered segment.
| ModuleOp child) { | ||
| for (NamedAttribute attr : outer->getAttrs()) { | ||
| static bool isBackendPartitionedContainer(ModuleOp module) { | ||
| return llvm::all_of(module.getOps<ModuleOp>(), |
There was a problem hiding this comment.
P3 -- isBackendPartitionedContainer validates nothing (always true).
static bool isBackendPartitionedContainer(ModuleOp module) {
return llvm::all_of(module.getOps<ModuleOp>(),
[](ModuleOp) { return true; });
}getOps<ModuleOp>() only yields the child modules and the predicate returns true for each, so this is vacuously true for any module (including one with zero children). It never checks that the outer module contains only child modules. It gates mixed-backend mode in resolveSingleBackend (children.size() > 1 && isBackendPartitionedContainer(module)), and collectChildJobs builds jobs solely from module.getOps<ModuleOp>() -- so any stray top-level op in the outer container (e.g. a func.func directly under it) is silently dropped from the output rather than rejected. Canonical PTODSL/doc IR only nests child modules, so this is latent, but the predicate should enforce the invariant it is named for, e.g.:
Block *body = module.getBody();
return !body->empty() &&
llvm::all_of(body->getOperations(),
[](Operation &op) { return isa<ModuleOp>(op); });and the driver should emit a diagnostic (not silently drop) when an outer container in mixed mode holds non-module top-level ops.
|
|
||
| if (fallbackMatches.size() == 1) | ||
| return fallbackMatches.front(); | ||
| if (fallbackMatches.empty()) { |
There was a problem hiding this comment.
P3 -- cross-child peer fallback resolves private siblings, contradicting the visibility filter a few lines above.
The outerModule.walk just above (the fallbackMatches collection) deliberately skips functions whose sym_visibility is "private", mirroring the driver''s findSiblingSourceFunction, which only accepts public sibling defs. But this empty-fallbackMatches fallback uses SymbolTable::lookupSymbolIn(childModule, target), which returns a symbol regardless of visibility. So when no public peer exists anywhere but a sibling child module has a matching private func.func, ImportReservedBufferOp::verify accepts it, whereas the driver''s mixed-backend assembly later rejects private peers ("unresolved cross-child peer_func reference"). Net effect: the op verifies clean, then fails later in the driver with a less localized message (no wrong code is emitted). The fallback also matches only by raw symbol name, not by pto.ptodsl.logical_name, unlike the walk above. Suggest skipping sym_visibility == "private" funcs here (matching the walk and the driver), or dropping the fallback since a public match would already have been found by the recursive walk.
| } | ||
|
|
||
| static FailureOr<func::FuncOp> | ||
| findSiblingSourceFunction(ModuleOp outer, ModuleOp targetChild, |
There was a problem hiding this comment.
为啥需要 peer func 来着,我有点忘记了,麻烦评论或者注释补充解释下
Summary
This PR brings in a PTODSL/ptoas update centered on flash-attention demos, mixed-kernel or mixed-backend compilation, and subkernel/backend helper lowering fixes.
Compared with
hw-native-sys/PTOAS:main, this branch is ahead by 12 commits and includes both frontend PTODSL improvements and the supporting PTOAS lowering/runtime/test updates.Main Changes
ptodsl/examples/flash_attention/flash_attention_cv_split.pyptodsl/examples/flash_attention/flash_attention_vf_fusion.pyptodsl/examples/flash_attention/ptodsl/examples/fa_dn_ptodsl.pygm_tensorslot support for PTODSL pipe flowPTOAS__TILE_DATAreads afterTASSIGNemitc.verbatimtrailing semicolons in emitted C++ as a workaroundpto.section.*regionsvmulscvtemission in the CANN900 LLVM emitterptodsl/docs/user_guide/01-introduction.mdptodsl/docs/user_guide/03-kernel-entry-and-subkernels.mdtest/lit/lit.cfg.pyScope
Diff summary:
Validation
fa-demo-restackis already pushed to forkZhendong404/PTOAShw-native-sys/PTOAS:mainZhendong404:fa-demo-restack