当前位置：首页 > news >正文

CANN/cannbot-skills a2设备约束

news 2026/5/9 14:44:23

a2 Device Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing a kernel targeting a2 (easyasc.a2, deviceb3). Do not read it for a5 kernels — the two architectures differ significantly.

Goal

Capture all a2-specific differences from a5 so that:

a5 patterns are not blindly reused on a2
the correct data path, buffer, and vec model is chosen from the start

1. Hardware budgets and missing features

For exact per-device capacities (L0A,L0B,L0C,UB,L1, cube core count, vec sub-blocks per core) seeagent/references/facts-device-runtime.md. For the a2-missing a5 features (@vf,Reg/RegList/MaskReg,l0c_to_ub,ub_to_l1_nd2nz,ub_to_l1_nz,micro,l0c_to_l1(float)), seeagent/references/facts-authoring.md.

Key consequence for tile strategy: a5 tile strategies that fit L0C at 256 KB overflow on a2. Always verifyTILE_M * TILE_N * 4 * 2 <= 128 KBfor float L0C DBuff.

2. Cube → vec data path on a2

Sincel0c_to_ubis absent andl0c_to_l1(float)is blocked:

Mandatory path:L0C → GM workspace → UB

Cube:l0c_to_gm_nz2ndwrites float L0C to a GM workspace buffer (FIX pipe)
Vec:gm_to_ub_padreads from GM workspace into UB (MTE2 pipe)

CvMutex configuration:

CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

This differs from the a5 standard (dst_end_pipe=Pipe.V) because the vec side's first consumer operation isgm_to_ub_padon MTE2, not a V-pipe compute.

GM workspace design:

Usesplit_workspace(DT.float, [GetCubeNum(), 2, TILE_M, TILE_N])for pingpong
Dimension2provides double-buffering slots
Index asws[GetCubeIdx(), slot, row_slice, col_slice]
Cube writes fullTILE_Mrows; each vec sub-block readsTILE_M // 2rows

3. Vec → cube data path on a2

Sinceub_to_l1_nd2nzandub_to_l1_nzare a5-only, a2 cannot publish vec output directly fromUBtoL1.

Mandatory path for delayed vec -> cube reuse:UB -> GM workspace -> L1

Vec:ub_to_gm_padwrites the vec result into a GM workspace buffer (MTE3 pipe)
Cube:gm_to_l1_nd2nzreloads that workspace tile intoL1(MTE2 pipe)
Cube then continues with the normall1_to_l0 -> mmadpath

This is the stable bridge for patterns such as:

stage 1 cube computes score
vec computesp_j
stage 2 cube consumes delayedp_j @ v_j

Recommended synchronization:

VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)

Whydst_end_pipe=Pipe.FIX:

vec producer truly ends onub_to_gm_pad(MTE3)
for delayed-consumer kernels, conservative release is simpler and safer
the cube side may reload from GM onMTE2, then continue throughL1 -> L0 -> MMAD -> FIX
freeing only after the cube stage finishes avoids premature workspace reuse

Workspace design mirrors the cube -> vec bridge:

usesplit_workspace(dtype, [GetCubeNum(), 2, TILE_M, TILE_N])
vec sub-block 0 writes rows[0:HALF_M]
vec sub-block 1 writes rows[HALF_M:TILE_M]
cube waits on theVcMutex, then reloads the full tile from the same slot

Important synchronization fact from the simulator/runtime model:

cube-sidewait_vec()succeeds only afterbothvec lanes have produced their tokens
this makes a full-tile cube reload safe after the two half-row vec writes complete

3a. Cube-side matmul dependency reuse rule on a2

The same high-level rule as a5 still applies on a2:

if one cube matmul only feeds a later cube matmul, prefer keeping that dependency on the cube-side path instead of inventing a UB detour

On a2 the practical boundary is stricter:

l0c_to_l1exists
butl0c_to_l1(float)is blocked onb*devices
andl0c_to_ubis absent

So the stable rule is:

if the intermediate dependency can be republished toL1in a supported dtype, preferL0C -> L1 -> L0 -> mmad
donotroute a pure cube-side dependency throughUB
only fall back to GM workspace when the dependency truly needs a vec-side stage or when the requiredL1destination dtype is unsupported

Practical implication:

a2 still benefits from the same "avoid unnecessaryL0C -> UB -> L1thinking" rule
the difference from a5 is not the existence ofl0c_to_l1, but the narrower dtype surface

4. Sub-block execution model

Each cube core has 2 vec sub-blocks. On a2:

Each sub-block has its ownindependent 192 KB UB
Vec instructions in the kernel body execute on both sub-blocks simultaneously
UseGetSubBlockIdx()to compute different GM offsets for each sub-block
Each sub-block reads its ownTILE_M // 2rows from the workspace

Typical pattern:

sb = GetSubBlockIdx() sb_row = Var(sb * HALF_M) ub_data <<= workspace[cube_idx, slot, sb_row:sb_row + HALF_M, 0:TILE_N] # ... vec processing ... out_row = Var(q_row + sb_row) output[out_row:out_row + HALF_M, col:col + TILE_N] <<= ub_out

5. Vec operations available on a2

All vec operations work on UB tensors directly (no Reg intermediate):

Category	Operations
Unary	`exp`,`ln`,`abs`,`rec`,`sqrt`,`rsqrt`,`relu`
Binary	`add`,`sub`,`mul`,`div`,`vmax`,`vmin`
Scalar	`adds`,`muls`,`vmaxs`,`vmins`,`axpy`
Reduction	`cmax`,`cgmax`,`cmin`,`cgmin`,`cadd`,`cgadd`,`cpadd`
Broadcast	`dup`,`brcb`
Datamove	`gm_to_ub_pad`,`ub_to_gm_pad`,`ub_to_ub`
Cast	`cast`
Compare	`compare`,`compare_scalar`,`set_cmpmask`
Select	`select`
Mask	`set_mask`,`reset_mask`

5.1 Bitpacked`uint8`masks and scalar-fill`select`

For a2 vec-side masking that is naturally bitpacked by column prefix, a practical shape isTensor(DT.uint8, [HALF_M, TILE_N // 8], Position.UB).

Stable rules:

compare(...)/compare_scalar(...)produce packed-bituint8masks for this path, andselect(...)consumes the same packed form
donottreatselectcontrol as an expanded[HALF_M, TILE_N]byte-per-element0/1tensor when the hardware contract is bitpacked
for rowwise prefix or suffix masks, prefer synthesizing mask bytes withcompare_scalar(...)against a reusable integer column-index tensor instead of populating every mask byte withSetValueTo(...)
dupdoesnotsupportuint8, but whole-mask initialization can often be done efficiently through a width-compatible integer reinterpret such asmask.reinterpret(DT.int)
SetValueTostill writes only the first element of the destination view, so per-byte loops are correct as a fallback but expensive in simulator control flow
select(..., SelectMode.TENSOR_SCALAR)only uses the first scalar value fromsrc2
for score-domain invalidation, a small scalar-fill source such asTensor(DT.float, [1, 64], Position.UB)initialized withdup(..., neg_large)is sufficient;src2does not need to match the full score tile shape

Use this when:

the control comes from an explicit packeduint8mask tensor
the fallback branch is a single scalar fill value likeneg_largeor0.0
you want to avoid allocating a full-size fallback tensor just forselect
you want to amortize mask construction by building it once and reusing it across many vec operations

Diagonal causal reuse rule:

for repeated diagonal masking on[HALF_M, TILE_N]tiles, it is valid to build one packed maskTensor(DT.uint8, [HALF_M, TILE_N // 8], Position.UB)and apply one full-tileselect(...)
if the diagonal geometry is stable across iterations, prebuild that packed mask once and reuse it
for a2 dense backward causal tails, full128x128q-tiles have stable subblock geometry and can reuse one static packed diagonal mask perGetSubBlockIdx()
tailMtiles cannot blindly reuse that full-tile mask when the kernel splits rows withhalf_rows = CeilDiv(valid_m, 2), because the second subblockrow_beginshifts away from the fixed64
in that case, keep the same packed-mask + one-select(...)pattern, but rebuild the packed mask dynamically only for the tail tile

Column-index template trick:

when the packed mask compares against contiguous column ids, build a reusable integer template once, e.g.Tensor(DT.int, [1, TILE_N], Position.UB)
a compact a2 pattern is to reinterpret that tensor asint64and write twoint32column ids per store while initializing[0, 1, 2, ..., TILE_N - 1]
latercompare_scalar(mask[row:row + 1, 0:TILE_N // 8], col_idx, valid_cols, CompareMode.LT)synthesizes the whole row mask without scalar-per-byte loops

6. UB initialization with`dup`

On a2, UB contents are undefined at kernel entry. There is no zero-initialization guarantee. Operations likemuls(ub, ub, 0.0)are unreliable on uninitialized buffers because0.0 × NaN = NaN.

Usedup(tensor, scalar_value)to fill a UB tensor with a known value.

For the current validated a2 softmax kernels, a sufficiently negative finite sentinel such asneg_large = -1.0e30is the default way to materialize score-domain invalidation and running-max identity state. Treat literalfloat("-inf")as simulator-convenient but hardware-fragile.

dupsignature:dup(dst: Tensor, value: Union[int, float, Var], ...)

Thedupoperation uses the same stride inference as other vec operations. It fills blocks and repeats according toinfer_repeat(dst)and the auto-inferred strides.

Coverage analysis for common shapes (float, C0=8):

Shape	repeat	blk_stride	Elements covered	Buffer size	Complete?
`[64, 1]`	1	1	1×8×8 = 64	64	✓
`[64, 8]`	8	0	8×1×8 = 64	512	✗ (only 1/8)
`[64, 64]`	64	1	64×8×8 = 4096	4096	✓
`[64, 128]`	128	1	128×8×8 = 8192	8192	✓

Warning:dupon[M, 8](broadcast-format) only fills 64 out of 512 elements. This is acceptable if the tensor is only consumed viablk_stride=0operations (likesub), which also only read those 64 positions. But it is incorrect if you later attempt a full element-wise operation over the entire 512-element buffer.

Practical rule: initialize in the natural computation format ([M, 1]for scalars,[M, N]for data), not in the broadcast format.

Placement within`auto_sync`

Adupplaced inside the outer loop but before the inner loop (e.g. for per-M-tile reinitialization) is safe. It generates an extra V→MTE3 autosync event pair because auto_sync sees the V-pipedupand the inner loop's MTE3ub_to_gm_padas a producer-consumer pair within the same scope.

This extra event is harmless:dupcompletes before the inner loop starts, so the MTE3 event is already satisfied when the first store executes. The inner loop's own V→MTE3 events for the actualexp → cast → storeflow are managed separately with different sync-key groups.

with auto_sync(): for gmt in range(mt_begin, mt_end): # ... variable declarations ... dup(ub_rmax_s, neg_large) # V-pipe, safe here for nt in range(0, tiles_n): # ... cube + vec tile processing ...

7. Scalar computation

scalar_sqrt(Var)requiresVar.dtype == Datatype.float; integer Vars will TypeError
For1/sqrt(D), prefer passing the precomputed float value as a kernel parameter
muls(ub, ub, scale_var)accepts Var(float) as the scalar argument

8. Copying scalar-format UB state

For row-scalar buffers on a2, the natural vec format is[M, 1]. This format is safe for vec binary ops likevmax,add,sub, andexp, but it isnota safe format forub_to_ubsnapshots.

Why:

ub_to_ubinfers burst length in units ofC0blocks
for float[64, 1], that implies copying one full 8-element block per row
this silently corrupts or repeats scalar row state instead of copying one scalar per row

Practical rule:

if you need to snapshot[M,1]running state such asprev_row_max, copy it with a vec binary op and an explicit zero buffer:

dup(ub_zero_s, 0.0) add(tmp_scalar_buf, ub_rmax_s, ub_zero_s)

then update or transformtmp_scalar_bufwith more vec ops
do not useub_to_ubas a generic copy for[M,1]scalar state

Files to study

agent/example/kernels/a2/qk_matmul_batched.py— cube-only a2 baseline
agent/example/kernels/a2/flash_attn_score.py— cube → vec with GM workspace bridge
agent/example/kernels/a2/flash_attn_score_iter.py— running max withdupinitialization and[M,1]vmax
agent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiff+ final numerator accumulation on a2
agent/example/kernels/a2/flash_attn_full.py— delayed numerator accumulation plus running sum/final divide on a2
agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py— rowwise diagonal-tile causal masking plusactive_tiles_nskip on a2, now with shared vec-sideDBuffscratch for stage-1/2 overlap
agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.py— block-32 causal masking plusactive_tiles_nskip on a2, also using the shared vec-sideDBuffscratch lineage
easyasc/stub_functions/vec/dupbrcb.py— dup stub (validates dst is UB, infers repeat)
easyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path for dup-style execution