Skip to content

[FEATURE]add cube tile ops(TLOAD>MAT, TSTORE>ACC, TSTORE.MAT, TSTORE_FP)#814

Open
pbbb205 wants to merge 12 commits into
hw-native-sys:mainfrom
pbbb205:main
Open

[FEATURE]add cube tile ops(TLOAD>MAT, TSTORE>ACC, TSTORE.MAT, TSTORE_FP)#814
pbbb205 wants to merge 12 commits into
hw-native-sys:mainfrom
pbbb205:main

Conversation

@pbbb205

@pbbb205 pbbb205 commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

add cube tile ops(TLOAD>MAT, TSTORE>ACC, TSTORE.MAT, TSTORE_FP)

@reedhecre

reedhecre commented Jun 13, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

Summary

发现 4 个实质问题,主要是 TSTORE.ACC/TLOAD.MAT 的模板选择错误,以及 TSTORE_FP/TLOAD.MAT 的合同不一致。

Findings

  1. P1 TSTORE.ACC 目的视图永远按 NZ2ND 选模板,DN/NZ 会被静默错降 lib/TileOps/tstore_template.py:303

PartitionTensorView 的 operand spec 只有 shape/strides,没有 layout config,所以这里的 dst.config 对真实的 pto.tstore 目的视图始终是 None。结果是 _constraint_tstore_acc_nz2nd() 走了第 304-305 行的默认 true,而 _constraint_tstore_acc_nz2dn()/_constraint_tstore_acc_nz2nz() 都会失败。我用 expand_helper 复现了一个 DN 视图(shape=[32,16], strides=[16,1])的选择,实际命中的是 tstore_acc_to_gm_nz2nd,生成的 VPTO 也是 pto.mte_l0c_gm ..., nz2nd。这会把 DN/NZ 输出静默写成 ND;同时本 PR 新增的 DN/NZ lit 只检查了 copy_matrix_cc_to_gm,CI 也抓不住这个错降。

  1. P1 TLOAD.MAT 在方阵上会同时匹配 ND2NZ 和 DN2NZ,直接触发模板 tie lib/TileOps/tload_template.py:359

这两个约束唯一的区分条件只是比较 src.shape[4]dst.valid_shape[1]/dst.valid_shape[0]。一旦目标 tile 是方阵(最常见的 16x16 就是),两个条件都会成立,两个 descriptor 又同优先级,因此 select_kernel() 会直接报 multiple highest-priority kernels。我用 expand_helper 复现了 16x16 的 GM->MAT load,结果同时命中 tload_gm_to_mat_nd2nztload_gm_to_mat_dn2nz 并失败。也就是说当前实现会把常见合法输入编译挂掉。

  1. P2 TSTORE_FP 宣称支持 UB scaling tile,但语义层仍然硬性拒绝 UB 指针 tilelang-dsl/python/tilelang_dsl/semantic.py:4653

PR 在模板约束和注释里都把 fp 放宽到了 SCALING/UB,但这里对 f32_f16/f32_bf16 payload 仍然调用 _require_fixpipe_vector_payload(),而它内部只接受 MemorySpace.SCALING。我把 tstore_fp_acc_to_gm 用一个 UB fp tile 具体 materialize 了一次,mlir_text() 直接抛出 TypeError: ... pre_quant payload requires MemorySpace.SCALING pointers。所以当前代码里宣称可用的 UB 形式实际上完全不可用。

  1. P2 TLOAD.MAT 的整数 dtype 合同不一致:约束接受,但没有任何可选模板 lib/TileOps/tload_template.py:401

_constraint_tload_mat_base() 明确把 i8/i16/i32 列进了支持集合,但两个实际注册的 cube template 只声明了 (f16,f16)(bf16,bf16)(f32,f32)。结果是 A5 上合法的整数 MAT load 能通过 IR 侧校验,却在模板选择阶段报 no registered kernel after constraint evaluation;我用 (i32, i32) 复现过这个问题。这里要么需要把约束/合同收窄到只支持浮点,要么把整数签名和模板补齐。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request translates comments to English and introduces several new Cube Matrix templates for TLOAD and TSTORE operations (such as GM to MAT, ACC to GM, MAT to GM, and ACC+FP to GM) along with constraint checks and comprehensive LIT tests. The review feedback highlights a critical correctness bug in template_tstore_mat_to_gm where the intermediate UB copy step is omitted, a potential AttributeError in template_tstore_fp_acc_to_gm from using .element_type instead of .dtype on a PartitionTensorView, incomplete layout constraint checks for NZ formats, and several unused variables across the templates.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lib/TileOps/tstore_template.py Outdated
Comment on lines +540 to +547
mat_ptr = src.as_ptr()
gm_ptr = dst.as_ptr()

# NOTE: There is no direct MAT -> GM DMA path. Data is written to UB via
# mte_l1_ub, then forwarded to GM via copy_ubuf_to_gm.
len_burst = k * elem_bytes

pto.mte_l1_ub(mat_ptr, gm_ptr, len_burst, nburst=(m, 0, 0))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical correctness bug in template_tstore_mat_to_gm. The comment states that data is written to UB via mte_l1_ub and then forwarded to GM via copy_ubuf_to_gm. However, the code calls mte_l1_ub with gm_ptr as the destination (which is a GM pointer, not a UB pointer), and completely omits the copy_ubuf_to_gm step. This will lead to compilation failures or hardware faults. Please allocate a temporary UB tile, copy from MAT to UB, and then copy from UB to GM.

Comment thread lib/TileOps/tstore_template.py Outdated
Comment on lines +624 to +629
dst_dtype = dst.element_type

if pto.constexpr(dst_dtype == pto.bf16):
quant_mode = "f32_bf16"
else:
quant_mode = "f32_f16"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In template_tstore_fp_acc_to_gm, dst is a PartitionTensorView. Calling dst.element_type is highly likely to raise an AttributeError because PartitionTensorView objects use .dtype to represent their data type (as seen in the constraints), whereas Tile objects use .element_type. Please use dst.dtype instead.

Suggested change
dst_dtype = dst.element_type
if pto.constexpr(dst_dtype == pto.bf16):
quant_mode = "f32_bf16"
else:
quant_mode = "f32_f16"
dst_dtype = dst.dtype
if pto.constexpr(dst_dtype == pto.bf16):
quant_mode = 'f32_bf16'
else:
quant_mode = 'f32_f16'

Comment on lines +311 to +319
def _constraint_tload_mat_base(src, dst) -> bool:
"""TLOAD.MAT base constraint check"""
# dst must be MemorySpace.MAT
dst_space = dst.memory_space
if dst_space is None:
return False
dst_space_value = dst_space.value if hasattr(dst_space, "value") else dst_space
if dst_space_value not in {"mat", "MAT"}:
return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _constraint_tload_mat_base, there is no check to ensure that the source src is in the GM memory space. Since TLOAD.MAT is specifically for loading from GM to L1 MAT, we should verify that src is indeed in GM to prevent incorrect matching of other memory spaces.

def _constraint_tload_mat_base(src, dst) -> bool:
    \"\"\"TLOAD.MAT base constraint check\"\"\"
    # src must be GM
    src_space = src.memory_space
    src_space_value = 'gm' if src_space is None else (src_space.value if hasattr(src_space, 'value') else src_space)
    if src_space_value not in {'gm', 'GM'}:
        return False
    # dst must be MemorySpace.MAT
    dst_space = dst.memory_space
    if dst_space is None:
        return False
    dst_space_value = dst_space.value if hasattr(dst_space, 'value') else dst_space
    if dst_space_value not in {'mat', 'MAT'}:
        return False

Comment on lines +334 to +349
def _constraint_tload_mat_nd2nz(src, dst) -> bool:
"""TLOAD.MAT ND2NZ fractal load constraint"""
if not _constraint_tload_mat_base(src, dst):
return False
# dst layout must be col_major (NZ format)
config = dst.config
if config is None:
return False
b_layout = config.b_layout
if b_layout is None:
return False
b_layout_value = b_layout.value if hasattr(b_layout, "value") else b_layout
# COL_MAJOR corresponds to NZ format
if b_layout_value not in {"col_major", "COL_MAJOR"}:
return False
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _constraint_tload_mat_nd2nz, only b_layout is checked to be col_major. However, NZ format is defined by both b_layout=col_major and s_layout=row_major. We should check both to ensure strict layout matching.

def _constraint_tload_mat_nd2nz(src, dst) -> bool:
    \"\"\"TLOAD.MAT ND2NZ fractal load constraint\"\"\"
    if not _constraint_tload_mat_base(src, dst):
        return False
    # dst layout must be col_major (NZ format)
    config = dst.config
    if config is None:
        return False
    b_layout = config.b_layout
    s_layout = config.s_layout
    if b_layout is None or s_layout is None:
        return False
    b_layout_value = b_layout.value if hasattr(b_layout, 'value') else b_layout
    s_layout_value = s_layout.value if hasattr(s_layout, 'value') else s_layout
    # COL_MAJOR + ROW_MAJOR corresponds to NZ format
    if b_layout_value not in {'col_major', 'COL_MAJOR'} or s_layout_value not in {'row_major', 'ROW_MAJOR'}:
        return False
    return True

Comment on lines +352 to +365
def _constraint_tload_mat_dn2nz(src, dst) -> bool:
"""TLOAD.MAT DN2NZ fractal load constraint"""
if not _constraint_tload_mat_base(src, dst):
return False
config = dst.config
if config is None:
return False
b_layout = config.b_layout
if b_layout is None:
return False
b_layout_value = b_layout.value if hasattr(b_layout, "value") else b_layout
if b_layout_value not in {"col_major", "COL_MAJOR"}:
return False
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _constraint_tload_mat_dn2nz, only b_layout is checked to be col_major. However, NZ format is defined by both b_layout=col_major and s_layout=row_major. We should check both to ensure strict layout matching.

def _constraint_tload_mat_dn2nz(src, dst) -> bool:
    \"\"\"TLOAD.MAT DN2NZ fractal load constraint\"\"\"
    if not _constraint_tload_mat_base(src, dst):
        return False
    config = dst.config
    if config is None:
        return False
    b_layout = config.b_layout
    s_layout = config.s_layout
    if b_layout is None or s_layout is None:
        return False
    b_layout_value = b_layout.value if hasattr(b_layout, 'value') else b_layout
    s_layout_value = s_layout.value if hasattr(s_layout, 'value') else s_layout
    if b_layout_value not in {'col_major', 'COL_MAJOR'} or s_layout_value not in {'row_major', 'ROW_MAJOR'}:
        return False
    return True

Comment thread lib/TileOps/tload_template.py Outdated
Comment on lines +391 to +393
m, k = dst.valid_shape
dtype = dst.element_type
elem_bytes = pto.bytewidth(dtype)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables dtype and elem_bytes are defined but never used in template_tload_gm_to_mat_nd2nz. They should be removed to keep the code clean.

Suggested change
m, k = dst.valid_shape
dtype = dst.element_type
elem_bytes = pto.bytewidth(dtype)
m, k = dst.valid_shape

Comment thread lib/TileOps/tload_template.py Outdated
Comment on lines +447 to +448
m, k = dst.valid_shape
dtype = dst.element_type

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable dtype is defined but never used in template_tload_gm_to_mat_dn2nz. It should be removed to keep the code clean.

Suggested change
m, k = dst.valid_shape
dtype = dst.element_type
m, k = dst.valid_shape

Comment on lines +332 to +347
def _constraint_tstore_acc_nz2nz(src, dst) -> bool:
"""TSTORE.ACC NZ2NZ constraint"""
if not _constraint_tstore_acc_base(src, dst):
return False
# dst must be NZ layout (fractal)
config = dst.config
if config is None:
return False
# Check for fractal or special NZ layout marking
s_layout = config.s_layout
if s_layout is None:
return False
s_layout_value = s_layout.value if hasattr(s_layout, "value") else s_layout
if s_layout_value not in {"row_major", "ROW_MAJOR"}:
return False
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _constraint_tstore_acc_nz2nz, only s_layout is checked to be row_major. However, NZ format is defined by both b_layout=col_major and s_layout=row_major. We should check both to ensure strict layout matching.

def _constraint_tstore_acc_nz2nz(src, dst) -> bool:
    \"\"\"TSTORE.ACC NZ2NZ constraint\"\"\"
    if not _constraint_tstore_acc_base(src, dst):
        return False
    # dst must be NZ layout (fractal)
    config = dst.config
    if config is None:
        return False
    b_layout = config.b_layout
    s_layout = config.s_layout
    if b_layout is None or s_layout is None:
        return False
    b_layout_value = b_layout.value if hasattr(b_layout, 'value') else b_layout
    s_layout_value = s_layout.value if hasattr(s_layout, 'value') else s_layout
    if b_layout_value not in {'col_major', 'COL_MAJOR'} or s_layout_value not in {'row_major', 'ROW_MAJOR'}:
        return False
    return True

Comment thread lib/TileOps/tstore_template.py Outdated
Comment on lines +375 to +376
m, n = src.valid_shape
dtype = src.element_type

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable dtype is defined but never used in template_tstore_acc_to_gm_nz2nd. It should be removed to keep the code clean.

Suggested change
m, n = src.valid_shape
dtype = src.element_type
m, n = src.valid_shape

Comment thread test/lit/pto/tload_mat_dn2nz_emitc.pto Outdated
}

// CHECK-LABEL: tload_mat_dn2nz_f16
// CHECK: TLOAD No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要测的不是pto -> emitc这条路径,是pto -> expand tileop -> vpto这条路径,需要设置--pto-arch=a5 --pto-backend=vpto
CHECK的目标是输出中包含对应的MTE指令

// INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
// See LICENSE in the root of the software repository for the full text of the License.

// RUN: not ptoas --pto-arch=a3 %s 2>&1 | FileCheck %s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--pto-arch=a3不是目标架构

# See LICENSE in the root of the software repository for the full text of the License.

"""`pto.tload` 的 TileLang DSL 模板"""
"""TileLang DSL template for `pto.tload`"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缺少端到端测试的st用例,可以构造一个GM ---tload---> MAT ---tstore---> GM的st用例进行验证,否则无法证明功能正确性

pbbb205 and others added 11 commits June 15, 2026 10:18
Reverts: 1df3eba, c31ea7c, fbd6071
These commits broke CI — TStoreFpOpVerify.cpp was written as a
string literal (not valid C++), tload/tstore templates had
semantic regressions, and test files were misstructured.

Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: tstore_template.py had `raise NotImplementedError` on line
543 which crashed the entire module import in TileLang DSL v1 (arbitrary
external calls are not supported). This made ALL templates in the file
unavailable, cascading to tstore, tload, tstore_fp, and tmrgsort failures.

Changes:
- tstore_template.py: replace `raise NotImplementedError` with `pass`
  (template is unreachable anyway due to dtypes=[])
- tstore_template.py: replace `dst.dtype` with `fp.element_type` in
  tstore_fp template (PartitionTensorView .dtype not supported in DSL v1)
- PTO.cpp: add ACC element type (f32/i32) check to A5 TStoreFPOp verifier
  (tstore_fp_invalid_dtype test expects verifier to reject f16 ACC)
- PTOOps.td: add OpPipeInterface + getPipe() to TStoreFPOp
  (returns PIPE_FIX, same pipe used by other ACC-store variants)

Co-Authored-By: Claude <noreply@anthropic.com>
…ule import

The _freeze_dtypes function requires at least one signature tuple.
dtypes=[] causes: "dtypes must contain at least one signature tuple",
which crashes the entire tstore_template.py import, losing ALL templates
in the file (including valid tstore_acc, tstore_fp templates).

Replace the disabled template + constraint + decorator with a comment
explaining why it's omitted entirely.

Co-Authored-By: Claude <noreply@anthropic.com>
Need to see stderr output from ptoas to diagnose why all 7 cube ckernel
template tests fail with 'stdin is empty'. Will restore 2>/dev/null after
fixing the actual issue.

Co-Authored-By: Claude <noreply@anthropic.com>
mode_value = self._require_string_expr(mode, f"{context} mode")
if mode_value in {"f32_f16", "f32_bf16"}:
self._require_fixpipe_scalar_payload(payload, f"{context} payload")
# f32_f16 / f32_bf16 pre_quant can be either a scalar (for inline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要改这里,底层vpto指令本身就不接受pointer输入

// Test TSTORE.ACC NZ2DN - tile-op expand to vpto mte_l0c_gm with nz2dn mode
// Pipeline: ExpandTileOp -> InlineLibCall -> FoldTileBufIntrinsics

// RUN: ptoas --pto-arch=a5 --pto-backend=vpto --emit-vpto --enable-tile-op-expand %s -o - | FileCheck %s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lit用例名字可以改一下,现在和emitc没关系了

@Zhendong404 Zhendong404 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要把PTO-ISA仓中与本PR相关的tload/tstore st用例迁移过来

# always matches the dst dtype in the tstore_fp signature.
fp_dtype = fp.element_type

if pto.constexpr(fp_dtype == pto.bf16):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tstore_fp不可能选择f32_bf16/f32_f16这两个mode

template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
          ReluPreMode reluPreMode = ReluPreMode::NoRelu, STPhase Phase = STPhase::Unspecified>
PTO_INTERNAL void TSTORE_IMPL(GlobalData &dst, TileData &src, FpTileData &fp)
{
    static_assert(TileData::Loc == pto::TileType::Acc, "Source TileType only suport Acc!");
    using DstT = typename GlobalData::RawDType;
    using L0cT = typename TileData::DType;
    CheckStaticAcc<TileData, GlobalData, true>();
    if constexpr (AtomicType::AtomicAdd == atomicType) {
        SetAtomicAdd<DstT>();
    }
    constexpr QuantMode_t quantPre = GetVectorPreQuantModeGm<L0cT, DstT>();
    TStoreAccFp<GlobalData, TileData, FpTileData, quantPre, reluPreMode>(
        dst.data(), src.data(), fp.data(), dst.GetShape(pto::GlobalTensorDim::DIM_0),
        dst.GetShape(pto::GlobalTensorDim::DIM_1), dst.GetShape(pto::GlobalTensorDim::DIM_2),
        dst.GetShape(pto::GlobalTensorDim::DIM_3), dst.GetShape(pto::GlobalTensorDim::DIM_4),
        dst.GetStride(pto::GlobalTensorDim::DIM_0), dst.GetStride(pto::GlobalTensorDim::DIM_1),
        dst.GetStride(pto::GlobalTensorDim::DIM_2), dst.GetStride(pto::GlobalTensorDim::DIM_3),
        dst.GetStride(pto::GlobalTensorDim::DIM_4), src.GetValidRow(), src.GetValidCol());
    if constexpr (atomicType == AtomicType::AtomicAdd) {
        set_atomic_none();
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants