当前位置：首页 > news >正文

CANN A2纯向量核编写

news 2026/5/9 21:55:50

A2 Vec-Only Kernel Authoring

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing or debugging a pure vec kernel on a2 (easyasc.a2) with no cube stage. Typical targets are elementwise transforms, bit-level float analysis, scalar-threshold gating, and quantization-style postprocess.

Do not use this file as the main guide for mixed cube/vec kernels. If cube is involved, start fromagent/references/constraints/a2-device.mdand the matching pattern file instead.

Goal

Capture the stable authoring rules for a2 vec-only kernels so that:

the kernel body starts from the right minimal structure
UB buffers are chosen intentionally
compare_scalar,select,reinterpret, andcastare used with the repository's real semantics
exact numeric contracts are not delegated to simulator rounding by accident

1. Use this layer when

This file is the right first read when:

the whole kernel isGM -> UB vec ops -> GM
there is no@vf, noReg, and no cube handoff
the logic is mostly elementwise, flag-driven, or bit-driven
the output contract depends on thresholding, saturation, or explicit rounding

Read another file first when:

you need row-wise reductions or narrow-broadcast arithmetic
- then also readagent/references/constraints/vec-reduction-a2.md
- andagent/references/constraints/vec-stride.md
you need explicit vec mask behavior
- then readagent/references/constraints/mask.md
you need cube -> vec or vec -> cube ownership
- then readagent/references/constraints/a2-device.md
- and the matching file underagent/references/patterns/

2. Minimal kernel skeleton

Stable pure-vec structure on a2:

@kernel() def vec_kernel(x: GMTensor, y: GMTensor, total: Var): data = Tensor(DT.float, [1, TILE], Position.UB) work = Tensor(DT.float, [1, TILE], Position.UB) flag = Tensor(DT.uint8, [1, TILE], Position.UB) with vec_scope(): n_tiles = CeilDiv(total, TILE) tile_per_core = CeilDiv(n_tiles, GetVecNum()) tile_start = Var(tile_per_core * GetVecIdx()) tile_end = Min(tile_start + tile_per_core, n_tiles) dup(...) with auto_sync(): for t in range(tile_start, tile_end): n1 = Var(t * TILE) n_valid = Min(total - n1, TILE) data <<= x[n1:n1 + n_valid] # vec compute on UB y[n1:n1 + n_valid] <<= work

What this skeleton gets right:

vec_scope()decides tile ownership across vec lanes before the loop
constants are initialized once withdup(...)
the inner loop keeps all work in UB
tail handling stays local throughn_valid

3. UB buffer selection rules

For pure vec kernels, prefer plainTensor(..., Position.UB)by default. Do not start fromDBuffunless you truly need staged overlap or lookahead.

Useful buffer categories:

data tiles:Tensor(DT.float, [1, TILE], Position.UB)
temporary compute buffers: same dtype and shape as the data tile
compare/select flags:Tensor(DT.uint8, [1, TILE], Position.UB)
bit masks for reinterpret paths:Tensor(DT.uint32, [1, TILE], Position.UB)or another width-matched integer view
final integer staging for exact rounding:Tensor(DT.int, [1, TILE], Position.UB)

Practical rule:

if the whole tile is consumed and produced once per loop iteration,Tensoris usually enough
if a buffer lifetime crosses iterations or producer/consumer stages, reconsider the topology before adding double buffering

4. Stable vec control idioms

4.1`compare_scalar`+`select`

Usecompare_scalarto builduint8flag tensors, then useselectto route values.

Important repository behavior:

compare_scalar(...)ignores the current vec mask
select(...)also ignores the current vec mask
selection is controlled only by the explicituint8flag tensor
on current a2 hardware/runtime, do not rely onuint8 -> floatcasts for compare flags; keep mask-controlled float paths incompare_scalar(...) + select(...)

This makes them the stable control-flow building blocks for pure vec kernels.

Typical uses:

finite vs non-finite split
underflow / overflow gating
sign-dependent bias selection
replacing invalid values before a bit reinterpret path

4.2 Non-finite guarding

If the later path assumes finite floats, sanitize first:

absub <<= x.abs() compare_scalar(finiteflag, absub, FLOAT32_FINITE_MAX, CompareMode.LE) select(workub, finiteflag, x, 0.0)

Then restore original non-finite values at the end:

select(outub, finiteflag, outub, x)

This avoids pushingNaN/Infthrough exponent extraction or scale math while keeping the control constant finite.

5. Bit-level float analysis with`reinterpret`

For exponent/mantissa logic, usereinterpret(...)instead of float arithmetic guesses.

Stable pattern fromagent/example/kernels/a2/to_hif8_torch.py:

x_u16 = workub.reinterpret(DT.uint16) exp_u16 = expub.reinterpret(DT.uint16) mask_u16 = expmask.reinterpret(DT.uint16) vand(exp_u16, x_u16, mask_u16)

Useful rules:

reinterpretis a view change, not a numeric cast
it rescales the second dimension by dtype-width ratio
it is legal on UB here
it does not supportL0C

When extracting absolute exponent-style metadata:

usevnot+vandon the reinterpreted integer view
then reinterpret toDT.intor another arithmetic dtype only after the bit pattern is where you want it

6. Exact rounding: do not over-trust default vec`cast`

Defaultcast(...)is convenient for ordinary dtype conversion, but it should not be treated as a proof of a higher-level numeric contract.

When the formula explicitly requires a rounding rule such as:

sign(x) * floor(abs(x) + 0.5)
round-half-away-from-zero
quantization followed by scale restore

prefer an explicit sequence.

Stable sequence:

outub <<= x / scale compare_scalar(nonnegflag, outub, 0.0, CompareMode.GE) select(biasub, nonnegflag, plus_halfub, -0.5) outub <<= outub + biasub cast(intub, outub, round_mode=RoundMode.TRUNC) cast(outub, intub) outub <<= outub * scale

Why this is safer:

the sign-dependent+0.5 / -0.5encodes the formula directly
RoundMode.TRUNCis only used for the final integer drop
the result no longer depends on the simulator's interpretation of a more implicit rounding mode

Practical rule:

use directcast(dst, src)when the formula only needs a normal dtype conversion
use an explicit bias +TRUNCpath when the rounding rule itself is part of the contract
if the decision came from auint8compare flag, materialize the float branch withselect(...); do not plan a follow-upuint8 -> floatcast

7. Tile-size and tail heuristics

For float vec-only kernels,TILE = 512is a good default starting point:

simple to reason about
comfortably small for a2 UB
large enough to amortize fixed per-tile work

For tail handling:

keepn_valid = Min(total - n1, TILE)
load/store through GM slices using that tail width
avoid adding a separate tail kernel unless the contract truly needs special handling

Do not optimize tile size first. Get the contract right with one simple tile size, then revisit only if UB pressure or runtime suggests it.

8. When a vec-only kernel stops being "simple"

Escalate to another focused file when you hit one of these signs:

wide[M, 128]buffers interacting with narrow[M, 8]buffers
- readagent/references/constraints/vec-stride.md
row max / row sum / online normalization
- readagent/references/constraints/vec-reduction-a2.md
temporary partial masks or masked writeback behavior
- readagent/references/constraints/mask.md
cross-stage workspace reuse or delayed consumer logic
- readagent/references/constraints/a2-device.md
- and the matching pattern underagent/references/patterns/

9. Concrete examples

Study first:

agent/example/kernels/a2/to_hif8_torch.py

Study carefully but do not copy blindly:

agent/example/demo/a2/a2_hif8.py

Why the demo is not enough for exact-contract work:

it is useful for exponent extraction and threshold structure
but it relies on a simpler cast/store path
and it is not the best source when the exact PyTorch rounding contract must be preserved

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

查看全文

http://www.jsqmd.com/news/785474/