当前位置: 首页 > news >正文

CANN/cannbot-skills a2设备约束

a2 Device Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing a kernel targeting a2 (easyasc.a2, deviceb3). Do not read it for a5 kernels — the two architectures differ significantly.

Goal

Capture all a2-specific differences from a5 so that:

  • a5 patterns are not blindly reused on a2
  • the correct data path, buffer, and vec model is chosen from the start

1. Hardware budgets and missing features

For exact per-device capacities (L0A,L0B,L0C,UB,L1, cube core count, vec sub-blocks per core) seeagent/references/facts-device-runtime.md. For the a2-missing a5 features (@vf,Reg/RegList/MaskReg,l0c_to_ub,ub_to_l1_nd2nz,ub_to_l1_nz,micro,l0c_to_l1(float)), seeagent/references/facts-authoring.md.

Key consequence for tile strategy: a5 tile strategies that fit L0C at 256 KB overflow on a2. Always verifyTILE_M * TILE_N * 4 * 2 <= 128 KBfor float L0C DBuff.

2. Cube → vec data path on a2

Sincel0c_to_ubis absent andl0c_to_l1(float)is blocked:

Mandatory path:L0C → GM workspace → UB

  • Cube:l0c_to_gm_nz2ndwrites float L0C to a GM workspace buffer (FIX pipe)
  • Vec:gm_to_ub_padreads from GM workspace into UB (MTE2 pipe)

CvMutex configuration:

CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

This differs from the a5 standard (dst_end_pipe=Pipe.V) because the vec side's first consumer operation isgm_to_ub_padon MTE2, not a V-pipe compute.

GM workspace design:

  • Usesplit_workspace(DT.float, [GetCubeNum(), 2, TILE_M, TILE_N])for pingpong
  • Dimension2provides double-buffering slots
  • Index asws[GetCubeIdx(), slot, row_slice, col_slice]
  • Cube writes fullTILE_Mrows; each vec sub-block readsTILE_M // 2rows

3. Vec → cube data path on a2

Sinceub_to_l1_nd2nzandub_to_l1_nzare a5-only, a2 cannot publish vec output directly fromUBtoL1.

Mandatory path for delayed vec -> cube reuse:UB -> GM workspace -> L1

  • Vec:ub_to_gm_padwrites the vec result into a GM workspace buffer (MTE3 pipe)
  • Cube:gm_to_l1_nd2nzreloads that workspace tile intoL1(MTE2 pipe)
  • Cube then continues with the normall1_to_l0 -> mmadpath

This is the stable bridge for patterns such as:

  • stage 1 cube computes score
  • vec computesp_j
  • stage 2 cube consumes delayedp_j @ v_j

Recommended synchronization:

VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)

Whydst_end_pipe=Pipe.FIX:

  • vec producer truly ends onub_to_gm_pad(MTE3)
  • for delayed-consumer kernels, conservative release is simpler and safer
  • the cube side may reload from GM onMTE2, then continue throughL1 -> L0 -> MMAD -> FIX
  • freeing only after the cube stage finishes avoids premature workspace reuse

Workspace design mirrors the cube -> vec bridge:

  • usesplit_workspace(dtype, [GetCubeNum(), 2, TILE_M, TILE_N])
  • vec sub-block 0 writes rows[0:HALF_M]
  • vec sub-block 1 writes rows[HALF_M:TILE_M]
  • cube waits on theVcMutex, then reloads the full tile from the same slot

Important synchronization fact from the simulator/runtime model:

  • cube-sidewait_vec()succeeds only afterbothvec lanes have produced their tokens
  • this makes a full-tile cube reload safe after the two half-row vec writes complete

3a. Cube-side matmul dependency reuse rule on a2

The same high-level rule as a5 still applies on a2:

  • if one cube matmul only feeds a later cube matmul, prefer keeping that dependency on the cube-side path instead of inventing a UB detour

On a2 the practical boundary is stricter:

  • l0c_to_l1exists
  • butl0c_to_l1(float)is blocked onb*devices
  • andl0c_to_ubis absent

So the stable rule is:

  • if the intermediate dependency can be republished toL1in a supported dtype, preferL0C -> L1 -> L0 -> mmad
  • donotroute a pure cube-side dependency throughUB
  • only fall back to GM workspace when the dependency truly needs a vec-side stage or when the requiredL1destination dtype is unsupported

Practical implication:

  • a2 still benefits from the same "avoid unnecessaryL0C -> UB -> L1thinking" rule
  • the difference from a5 is not the existence ofl0c_to_l1, but the narrower dtype surface

4. Sub-block execution model

Each cube core has 2 vec sub-blocks. On a2:

  • Each sub-block has its ownindependent 192 KB UB
  • Vec instructions in the kernel body execute on both sub-blocks simultaneously
  • UseGetSubBlockIdx()to compute different GM offsets for each sub-block
  • Each sub-block reads its ownTILE_M // 2rows from the workspace

Typical pattern:

sb = GetSubBlockIdx() sb_row = Var(sb * HALF_M) ub_data <<= workspace[cube_idx, slot, sb_row:sb_row + HALF_M, 0:TILE_N] # ... vec processing ... out_row = Var(q_row + sb_row) output[out_row:out_row + HALF_M, col:col + TILE_N] <<= ub_out

5. Vec operations available on a2

All vec operations work on UB tensors directly (no Reg intermediate):

CategoryOperations
Unaryexp,ln,abs,rec,sqrt,rsqrt,relu
Binaryadd,sub,mul,div,vmax,vmin
Scalaradds,muls,vmaxs,vmins,axpy
Reductioncmax,cgmax,cmin,cgmin,cadd,cgadd,cpadd
Broadcastdup,brcb
Datamovegm_to_ub_pad,ub_to_gm_pad,ub_to_ub
Castcast
Comparecompare,compare_scalar,set_cmpmask
Selectselect
Maskset_mask,reset_mask

5.1 Bitpackeduint8masks and scalar-fillselect

For a2 vec-side masking that is naturally bitpacked by column prefix, a practical shape isTensor(DT.uint8, [HALF_M, TILE_N // 8], Position.UB).

Stable rules:

  • compare(...)/compare_scalar(...)produce packed-bituint8masks for this path, andselect(...)consumes the same packed form
  • donottreatselectcontrol as an expanded[HALF_M, TILE_N]byte-per-element0/1tensor when the hardware contract is bitpacked
  • for rowwise prefix or suffix masks, prefer synthesizing mask bytes withcompare_scalar(...)against a reusable integer column-index tensor instead of populating every mask byte withSetValueTo(...)
  • dupdoesnotsupportuint8, but whole-mask initialization can often be done efficiently through a width-compatible integer reinterpret such asmask.reinterpret(DT.int)
  • SetValueTostill writes only the first element of the destination view, so per-byte loops are correct as a fallback but expensive in simulator control flow
  • select(..., SelectMode.TENSOR_SCALAR)only uses the first scalar value fromsrc2
  • for score-domain invalidation, a small scalar-fill source such asTensor(DT.float, [1, 64], Position.UB)initialized withdup(..., neg_large)is sufficient;src2does not need to match the full score tile shape

Use this when:

  • the control comes from an explicit packeduint8mask tensor
  • the fallback branch is a single scalar fill value likeneg_largeor0.0
  • you want to avoid allocating a full-size fallback tensor just forselect
  • you want to amortize mask construction by building it once and reusing it across many vec operations

Diagonal causal reuse rule:

  • for repeated diagonal masking on[HALF_M, TILE_N]tiles, it is valid to build one packed maskTensor(DT.uint8, [HALF_M, TILE_N // 8], Position.UB)and apply one full-tileselect(...)
  • if the diagonal geometry is stable across iterations, prebuild that packed mask once and reuse it
  • for a2 dense backward causal tails, full128x128q-tiles have stable subblock geometry and can reuse one static packed diagonal mask perGetSubBlockIdx()
  • tailMtiles cannot blindly reuse that full-tile mask when the kernel splits rows withhalf_rows = CeilDiv(valid_m, 2), because the second subblockrow_beginshifts away from the fixed64
  • in that case, keep the same packed-mask + one-select(...)pattern, but rebuild the packed mask dynamically only for the tail tile

Column-index template trick:

  • when the packed mask compares against contiguous column ids, build a reusable integer template once, e.g.Tensor(DT.int, [1, TILE_N], Position.UB)
  • a compact a2 pattern is to reinterpret that tensor asint64and write twoint32column ids per store while initializing[0, 1, 2, ..., TILE_N - 1]
  • latercompare_scalar(mask[row:row + 1, 0:TILE_N // 8], col_idx, valid_cols, CompareMode.LT)synthesizes the whole row mask without scalar-per-byte loops

6. UB initialization withdup

On a2, UB contents are undefined at kernel entry. There is no zero-initialization guarantee. Operations likemuls(ub, ub, 0.0)are unreliable on uninitialized buffers because0.0 × NaN = NaN.

Usedup(tensor, scalar_value)to fill a UB tensor with a known value.

For the current validated a2 softmax kernels, a sufficiently negative finite sentinel such asneg_large = -1.0e30is the default way to materialize score-domain invalidation and running-max identity state. Treat literalfloat("-inf")as simulator-convenient but hardware-fragile.

dupsignature:dup(dst: Tensor, value: Union[int, float, Var], ...)

Thedupoperation uses the same stride inference as other vec operations. It fills blocks and repeats according toinfer_repeat(dst)and the auto-inferred strides.

Coverage analysis for common shapes (float, C0=8):

Shaperepeatblk_strideElements coveredBuffer sizeComplete?
[64, 1]111×8×8 = 6464
[64, 8]808×1×8 = 64512✗ (only 1/8)
[64, 64]64164×8×8 = 40964096
[64, 128]1281128×8×8 = 81928192

Warning:dupon[M, 8](broadcast-format) only fills 64 out of 512 elements. This is acceptable if the tensor is only consumed viablk_stride=0operations (likesub), which also only read those 64 positions. But it is incorrect if you later attempt a full element-wise operation over the entire 512-element buffer.

Practical rule: initialize in the natural computation format ([M, 1]for scalars,[M, N]for data), not in the broadcast format.

Placement withinauto_sync

Adupplaced inside the outer loop but before the inner loop (e.g. for per-M-tile reinitialization) is safe. It generates an extra V→MTE3 autosync event pair because auto_sync sees the V-pipedupand the inner loop's MTE3ub_to_gm_padas a producer-consumer pair within the same scope.

This extra event is harmless:dupcompletes before the inner loop starts, so the MTE3 event is already satisfied when the first store executes. The inner loop's own V→MTE3 events for the actualexp → cast → storeflow are managed separately with different sync-key groups.

with auto_sync(): for gmt in range(mt_begin, mt_end): # ... variable declarations ... dup(ub_rmax_s, neg_large) # V-pipe, safe here for nt in range(0, tiles_n): # ... cube + vec tile processing ...

7. Scalar computation

  • scalar_sqrt(Var)requiresVar.dtype == Datatype.float; integer Vars will TypeError
  • For1/sqrt(D), prefer passing the precomputed float value as a kernel parameter
  • muls(ub, ub, scale_var)accepts Var(float) as the scalar argument

8. Copying scalar-format UB state

For row-scalar buffers on a2, the natural vec format is[M, 1]. This format is safe for vec binary ops likevmax,add,sub, andexp, but it isnota safe format forub_to_ubsnapshots.

Why:

  • ub_to_ubinfers burst length in units ofC0blocks
  • for float[64, 1], that implies copying one full 8-element block per row
  • this silently corrupts or repeats scalar row state instead of copying one scalar per row

Practical rule:

  • if you need to snapshot[M,1]running state such asprev_row_max, copy it with a vec binary op and an explicit zero buffer:
dup(ub_zero_s, 0.0) add(tmp_scalar_buf, ub_rmax_s, ub_zero_s)
  • then update or transformtmp_scalar_bufwith more vec ops
  • do not useub_to_ubas a generic copy for[M,1]scalar state

Files to study

  • agent/example/kernels/a2/qk_matmul_batched.py— cube-only a2 baseline
  • agent/example/kernels/a2/flash_attn_score.py— cube → vec with GM workspace bridge
  • agent/example/kernels/a2/flash_attn_score_iter.py— running max withdupinitialization and[M,1]vmax
  • agent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiff+ final numerator accumulation on a2
  • agent/example/kernels/a2/flash_attn_full.py— delayed numerator accumulation plus running sum/final divide on a2
  • agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py— rowwise diagonal-tile causal masking plusactive_tiles_nskip on a2, now with shared vec-sideDBuffscratch for stage-1/2 overlap
  • agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.py— block-32 causal masking plusactive_tiles_nskip on a2, also using the shared vec-sideDBuffscratch lineage
  • easyasc/stub_functions/vec/dupbrcb.py— dup stub (validates dst is UB, infers repeat)
  • easyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path for dup-style execution

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/783500/

相关文章:

  • CANN运行时任务更新指南
  • Llama 3.2 Vision轻量微调实战:500图打造电商级图文生成模型
  • CANN/HCOMM线程通知等待函数
  • CANN KV压缩Epilog算子
  • 活动大屏LED租赁哪个公司好 - 速递信息
  • 谷歌智能眼镜2026年将问世,Gemini驱动,多品牌合作亮点多!
  • CANN/cann-recipes-infer MoE路由分组量化算子
  • STRAIGHT_JOIN 用法
  • 区块链+AI+DAO构建反性勒索平台:技术架构与实战解析
  • 从clevercli看AI命令行工具的设计哲学与工程实践
  • 通过curl命令直接测试Taotoken多模型聚合接口的响应
  • 2026知名CRM系统测评:12款客户管理系统价值解析 - Blue_dou
  • CANN PTO Tile-Scalar汇编操作
  • LIME实战避坑指南:从医疗影像到金融风控的可解释性落地
  • Phi-2小模型深度解析:27亿参数如何实现强推理与高效部署
  • GEE实战:用MOD17A3HGF和MYD17A2H数据,手把手教你生成8天和月度NPP数据集(附完整代码)
  • 基于辩证唯物主义认识论的大语言模型架构设计与机理分析
  • AIGC检测是什么?论文查AI率和论文查重有什么不同?
  • ChatGPT推理能力深度测试:从假设演绎到因果推理的AGI试金石
  • CANN/pypto矩阵乘法API文档
  • 2026年德州沥青加温设备、沥青储存罐与筑路设备源头厂家选购指南 - 企业名录优选推荐
  • Python字典底层原理与工程实践全解
  • CANN/ops-cv ResizeBilinearV2反向传播算子
  • 论文改到崩溃?Paperxie 把查重降重的坑都给你填平了
  • 在 RTOS 里使用 UART——信号量 + DMA 回调框架
  • AdvancedTCA架构:电信与超算融合的技术解析
  • 基于主题建模的教育多模态与生成式AI研究全景分析
  • 初创公司如何借助 Taotoken 的按 token 计费模式控制 AI 实验成本
  • 范进人生轨迹
  • AI预测抗生素耐药性:从数据清洗到可解释模型的全流程实战