当前位置：首页 > news >正文

CANN/cannbot-skills A5设备约束指南

news 2026/7/3 23:57:01

a5 Device Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing a kernel targeting a5 (easyasc.a5, device950) and the kernel has any vec-side stage. Do not use it as a substitute for the general kernel-authoring playbook.

Goal

Capture the stable a5 vec-side authoring surface so that:

a2-style direct vec-body patterns are not copied into a5 kernels
vec-side work starts on the supported authoring surfaces
easyasc.a5import breadth is not mistaken for the stable kernel-writing contract

1. Stable a5 vec-side authoring rule

For a5, vec-side work should be authored through:

@vf()helpers for ordinary vec preprocess / postprocess
microops inside@vf()when register-level control is required
sort-family ops such assort32,mergesort4, andmergesort_2seqwhen the kernel genuinely needs sort behavior
ub_to_ubfor UB-local copies or layout-preserving handoff steps

Donotwrite generic a2-style vec UB ops directly in the a5 kernel body. If the step is elementwise, row-wise, reduction, normalization, or cast-oriented on a5, move it into@vf()first and only drop tomicrowhen@vf()alone is not enough.

Important note:

the raweasyasc.a5export surface is wider than this stable authoring rule
treat the authoring rule above as the repository contract for new a5 kernels

2. Contrast with a2

a2 doesnotsupport@vf()
a2 doesnotsupportmicro
a2 vec work is written directly in the kernel body on UB tensors
do not mirror an a2 pure-vec kernel body into a5, or an a5@vf()flow into a2

3. Implications for common topologies

Stable a5 forms:

cube -> vec:GM -> L1 -> L0 -> L0C -> UB -> @vf() -> GM
vec -> cube:GM -> UB -> @vf() -> UB -> L1 -> L0 -> L0C -> GM
vec-only transform:GM -> UB -> @vf()orGM -> UB -> @vf() + micro -> GM
UB-local republish / copy:ub_to_ubmay stay in the kernel body if it is truly just the copy step

Practical rule:

if you are about to call ordinary vec math on an a5 UB tensor from the kernel body, stop and move that logic into@vf()

3a. Cube-side matmul dependency reuse rule

When a later cube matmul depends on the result of an earlier cube matmul, check first whether the dependency can stay on the cube-side path:

producer:mmad -> L0C
republish:l0c_to_l1
consumer: laterl1_to_l0 -> mmad

Prefer that directL0C -> L1route when:

the intermediate value is only needed by a later cube-side matmul
no vec-side transform is required on the intermediate value before reuse

Avoid the detour:

L0C -> UB -> L1

unless the UB hop is semantically required for a real vec-side stage such as:

cast / normalization / elementwise transform in@vf()
a cube -> vec handoff that genuinely changes ownership to the vec lane

Reason:

l0c_to_l1already gives you the FIX-side republish path for this dependency
the UB detour adds traffic, adds synchronization surface, and makes the kernel easier to overcomplicate without adding capability

Practical debugging hint:

if you find yourself moving a pure matmul dependency through UB only so a later matmul can read it back, stop and re-check whetherl0c_to_l1already expresses the intended dependence

4. When to use`micro`

Usemicroon a5 when the vec stage needs register-level behavior such as:

explicit fp8 cast control
pack4()/ sparse-lane squeeze patterns
explicit mask or cast-config handling
custom register reductions or packing not expressible cleanly as plainTensor <<= Reg/RegList

Prefer plain@vf()first when it already matches the contract. For example, aReg/RegListloaded in@vf()and written back withdst[...] <<= regsis usually simpler than dropping to explicitmicro cast + pack4.

Another stable case that should stay in@vf():

row-recursive vec kernels where each output row depends on the previous output row
example shape: load one GM chunk as[chunk_size, H], then compute
- y[0, :] = x[0, :]
- y[i, :] = x[i, :] + y[i - 1, :]
on a5, keep that recurrence in@vf()withReg/RegListslices over the row width
donotreach forcpaddor custommicrojust because the math is cumulative;cpaddis pair-wise add, not row-prefix recurrence
only drop tomicroif the recurrence itself needs per-lane scan behavior inside one row rather than previous-row carry