当前位置：首页 > news >正文

CANN/cannbot-skills Cube-Vec模式

news 2026/5/9 20:15:57

Cube-to-Vec-to-Cube-to-Vec Pattern

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Generic baseline only. For a2 (b3) kernels, preferagent/references/patterns/a2-cube-vec-cube-vec.md(and the softmax varianta2-cube-vec-cube-vec-softmax.md), which add delayed-consumer and running-statistic rules specific to a2.

Read this file when one cube stage feeds vec logic, then another cube stage, then a final vec stage. This is the highest-complexity staged pattern currently worth documenting as a dedicated route.

Use this pattern when

there are at least two cube-heavy stages with vec-side logic between them
one tile may be produced in one iteration and consumed in a later iteration
delayed state such as softmax stats or rescale factors must follow the consumer lifetime

Minimal flow

cube stage 1 -> vec stage 1 -> cube stage 2 -> vec stage 2

In practice this often becomes a one-tile lookahead schedule with warmup and drain.

What usually matters most

keeping producer and delayed consumer lifetimes separate
giving delayed stages their own counters
deciding whether the bridge should stay on chip or go through GM workspace
keeping scalar state aligned with the delayed consumer, not with the original producer
validating each stage before trusting the fused version

Stable repository lessons

if stage 2 reuses a stage 1 operand one iteration later, keep that operand on chip when the lifetime fits
if the reuse does not fit cleanly, materialize an explicit GM workspace instead of forcing a fake on-chip story
do not normalize too early when the numerator and denominator streams must both finish first
when the live query side is truly one row, flatten(B, H)into oneBHaxis and keeprows=1instead of forcing a wider row tile
for half-inputBASES=256attention on a5, keep the outer256tile in L1, usesplitk=64forq @ k.t(), andsplitn=64forp @ v
for fp8 decode attention with external scales, mask invalid tail columns to-infbeforerowmax, scale the probability tile only after the floatrow_sumupdate, and compensate with a finalscale_v / P_SCALE
if the delayed cube consumer wants packed-NZ input, pack the vec-produced tile in UB first, then publish that NZ view intoL1

One-tile lookahead scheduling detail

The retained MLA kernel (agent/example/kernels/a5/test_mla_entire.py) uses a four-stage on-chip flow:

cube: produce score tilei
vec: update streaming softmax state and cast score tileito probability tilei
cube: consume delayed probability tilei-1with the matching value/key tile
vec: rescale and accumulate the delayed output tilei-1

Stable control pattern:for s in range(0, S + TILE, TILE)with:

if s < S: producer side (warmup + steady state)
if s > 0: delayed consumer side (steady state + drain)

On-chip operand reuse:

if stage 2 must reuse a stage 1 operand one iteration later, keep that operand resident on chip instead of round-tripping to GM
in the MLA kernel,k_nopestays inl1knand the vec-producedptile is published directly intol1p
inagent/example/kernels/a5/mha_ifa_nz.py, the vec-producedptile is first packed withreg_to_ub(...)and then published tol1pas.nz()

Delayed scalar state:

delayed scalar state must follow the consumer lifetime, not the producer lifetime
cache per-tilerow_exp_diff/ rescale factors in a slot indexed by the delayed consumer counter
keep runningrow_max,row_sum, andoutput_accunder a single vec owner to avoid duplicate updates