当前位置: 首页 > news >正文

CANN a2向量归约约束

Vec Reduction on a2 (cmax + brcb Pattern)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when implementing per-row reductions (max, sum) on a2 using the vec pipeline. On a2 there are noReg/RegList, so reductions use UB-to-UBcmax/cadd+brcb.

Goal

Get per-row max (or sum) correct on a2, including the broadcast step that is easy to forget.

1. The cmax output format

cmax(dst, src)reduces one repeat (64 float elements = 8 blocks of 8) to asingle scalar. The scalar is stored atdst[rep * dst_rep_stride]— one float element per repeat.

With the defaultdst_rep_stride=1, the scalars are packed densely:

dst[0] = max of row 0 dst[1] = max of row 1 ... dst[63] = max of row 63

This isnota C0 block layout. The 8-element block structure thatsub/vmaxexpect is not satisfied.

2. The bug: using cmax output directly in sub

If you pass the cmax output tosubwithblk_stride=0:

  • subreads a C0 block (8 elements) and broadcasts it across all 8 blocks of each repeat
  • But the 8 elements in that block are maxes of8 different rows, not 8 copies of one row's max
  • Result: each row gets subtracted by the wrong max →expproduces huge or wrong values

Symptom: output values > 1.0 fromexp(score - max)where max should be the row max.

3. The fix: brcb broadcast between cmax and sub

After cmax, usebrcbto expand each scalar to fill a full C0 block:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # cmax scalars ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast result cmax(ub_max_s, ub_tmp) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8)

How brcb works:

  • repeat = infer_repeat_brcb(src) = HALF_M * 1 // 8 = 8
  • For each repeat: reads 8 scalars fromsrc[rep*8 : rep*8+8]
  • For each of 8 blocks: fillsdst[block_begin : block_begin + C0]with one scalar
  • Withdst_blk_stride=1, dst_rep_stride=8: blocks are contiguous, repeats advance by 8 blocks

Result:ub_max[n*8 : n*8+8]all containmax_of_row_nfor n in 0..63.

3a. Dense row[1, 64]-> broadcast[64, 8]also needs explicitbrcbparams

When the scalar statistics arrive as one dense row such as:

  • qkmaxbuf = Tensor(DT.float, [1, 64], Position.UB)
  • qksumbuf = Tensor(DT.float, [1, 64], Position.UB)

and the destination is the usual broadcast format:

  • qkmaxbrcb = Tensor(DT.float, [64, 8], Position.UB)

donotrely on defaultbrcb(...)parameter inference.

Validated pattern:

qkmaxbuf <<= qkmax[bh:bh + 1, row0:row0 + 64] brcb(qkmaxbrcb, qkmaxbuf, repeat=64 // 8, dst_blk_stride=1, dst_rep_stride=8)

Why this matters:

  • the source load into[1,64]is fine
  • the failure comes from the broadcast configuration, not from the GM -> UB read itself
  • with the validated explicit parameters, rowris expanded toqkmaxbrcb[r, 0:8]

Concrete reproducer:

  • tmp/validate_row64_brcb.py

Practical rule:

  • for row-stat broadcasts on a2, treatbrcb(..., dst_blk_stride=1, dst_rep_stride=8)as mandatory
  • when the source is[1,64], also pinrepeat=64 // 8explicitly in validated kernels instead of trusting defaults

4. Complete row-max pattern for [HALF_M, 128] float data

HALF_M = 64 HALF_N = 64 ub_data = Tensor(DT.float, [HALF_M, 128], Position.UB) ub_tmp = Tensor(DT.float, [HALF_M, HALF_N], Position.UB) ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # Step 1: element-wise max of two 64-col halves → 64 values per row vmax(ub_tmp, ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, HALF_N:128]) # Step 2: reduce 64 → 1 scalar per row cmax(ub_max_s, ub_tmp) # Step 3: broadcast each scalar to fill a C0 block (8 identical elements) brcb(ub_max, ub_max_s, dst_blk_stride=1, dst_rep_stride=8) # Step 4: subtract (sliced to align repeat with narrow max buf) sub(ub_data[0:HALF_M, 0:HALF_N], ub_data[0:HALF_M, 0:HALF_N], ub_max) sub(ub_data[0:HALF_M, HALF_N:128], ub_data[0:HALF_M, HALF_N:128], ub_max)

Why each step is needed:

  • vmax: 128 columns exceed one repeat (64 elements). Must merge to 64 first.
  • cmax: reduces 64 → 1 scalar per row. Output is dense, not block-aligned.
  • brcb: fills C0 blocks so thatsubwithblk_stride=0broadcasts correctly.
  • sub with slicing: seeagent/references/constraints/vec-stride.mdfor why.

5. Why[M, 8]broadcast format fails for binary ops between two narrow buffers

Afterbrcb, the result tensor has shape[HALF_M, 8]withspan[1]=8=C0. Stride inference for[64, 8]float gives:blk_stride=0, rep_stride=1, repeat=8.

Withblk_stride=0, all 8 blocks within one repeat address thesame8 elements. So each repeat touches 8 unique elements, and 8 repeats touch 8×8=64 elements. But the buffer contains 64×8=512 elements. The remaining 448 arenever reached.

This meansvmax(buf_a[64,8], buf_a[64,8], buf_b[64,8])only computes the max for the first 8 rows. Rows 8–63 are left unchanged.

Root cause:blk_stride=0is the broadcast stride designed forsub(wide, wide, narrow), where the wide destination's repeat cadence drives iteration and the narrow source stays per-row. It was never intended for element-wise operations between two identically-shaped narrow buffers.

Diagnostic method: before choosing a tensor format for any vec binary operation, manually trace:

  1. infer_repeat(dst)=span[0] * span[1] / (256 // dtype.size)
  2. infer_strides(tensor)— check ifblk_stride=0or1
  3. total unique elements =repeat × (8 if blk_stride==1 else 1) × elements_per_block
  4. compare against the actual element count (shape[0] * shape[1])

If the totals disagree, the operation will silently skip elements.

Reference implementation:easyasc/stub_functions/vec/vecutils.py(infer_strides,infer_repeat).

6. Using[M, 1]scalar format for binary ops between reduction outputs

Thecmaxoutput[HALF_M, 1]hasspan[1]=1. Stride inference for[64, 1]float:span[1]=1matches neither64nor8, so defaults apply:blk_stride=1, rep_stride=8, repeat=1.

Withblk_stride=1and 8 blocks per repeat:

  • Block 0: elements[0:8]
  • Block 1: elements[8:16]
  • Block 7: elements[56:64]
  • Total: 1 repeat × 8 blocks × 8 elements =64 elements = all rows

Sovmax(dst[64,1], src1[64,1], src2[64,1])correctly computes per-row element-wise max over all 64 dense scalars fromcmaxoutput. No rows are skipped.

Key insight: operate on the dense scalar[M, 1]format BEFOREbrcbbroadcast. Onlybrcbto[M, 8]after the scalar-level operation is complete.

Validated pattern for running max across tiles:

ub_max_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # per-tile cmax output ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) # running max (persistent) ub_max = Tensor(DT.float, [HALF_M, 8], Position.UB) # broadcast for sub # before inner loop: initialize running max dup(ub_rmax_s, neg_large) # inside each tile: cmax(ub_max_s, ub_tmp) # per-tile row max vmax(ub_rmax_s, ub_rmax_s, ub_max_s) # update in [M,1] format brcb(ub_max, ub_rmax_s, dst_blk_stride=1, dst_rep_stride=8) # broadcast AFTER update sub(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_max) sub(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_max)

Hereneg_largeis a sufficiently large finite negative sentinel, not literalfloat("-inf").

UB overhead for running max: one extra[64, 1]float tensor = 0.25 KB.

6a. Copying[M,1]scalar state across iterations

The validated running-max pattern often needs a snapshot of the previous scalar state before updating it, for example to computeexp(prev_m - curr_m)in streamed attention.

Donotsnapshot[M,1]buffers withub_to_ub.

Why this fails:

  • ub_to_ubworks inC0-sized blocks
  • for float[64,1], that means an 8-element block copy per row
  • the operation does not mean "copy one scalar per row"

Stable fix:

  • allocate a zero buffer in the same[M,1]format
  • use a vec binary op such asadd(dst, src, zero)to make the copy

Example:

ub_prev_s = DBuff(DT.float, [HALF_M, 1], Position.UB) ub_rmax_s = Tensor(DT.float, [HALF_M, 1], Position.UB) ub_zero_s = Tensor(DT.float, [HALF_M, 1], Position.UB) dup(ub_zero_s, 0.0) add(ub_prev_s[slot], ub_rmax_s, ub_zero_s) # safe scalar-format copy vmax(ub_rmax_s, ub_rmax_s, ub_max_s) sub(ub_prev_s[slot], ub_prev_s[slot], ub_rmax_s) exp(ub_prev_s[slot], ub_prev_s[slot])

Study:

  • agent/example/kernels/a2/flash_attn_unnorm.py
  • agent/references/patterns/a2-cube-vec-cube-vec.md

7. Adapting for row sum (cadd)

Same pattern, replacevmaxadd,cmaxcadd:

add(ub_tmp, ub_data[0:M, 0:64], ub_data[0:M, 64:128]) cadd(ub_sum_s, ub_tmp) brcb(ub_sum, ub_sum_s, dst_blk_stride=1, dst_rep_stride=8) div(ub_data[0:M, 0:64], ub_data[0:M, 0:64], ub_sum) div(ub_data[0:M, 64:128], ub_data[0:M, 64:128], ub_sum)

For streamed normalized attention on a2, the stable update order is:

  1. computeexpdiff = exp(prev_max - curr_max)in[M,1]
  2. compute the float probability tilep = exp(score - curr_max)
  3. reducesum_jfrom that float tile withadd+cadd
  4. updaterow_sum = row_sum * expdiff + sum_jin[M,1]
  5. castpto half only after the sum update if the downstream cube stage needsp.half().float()

8. UB cost

BufferShapeBytes (float)
ub_tmp[64, 64]16 KB
ub_max_s[64, 1]0.25 KB
ub_max[64, 8]2 KB
Total reduction overhead~18.25 KB

Files to study

  • agent/example/kernels/a2/flash_attn_score.py— per-tile independent row max
  • agent/example/kernels/a2/flash_attn_score_iter.py— running max across tiles using[M,1]scalarvmax
  • agent/example/kernels/a2/flash_attn_unnorm.py— delayedexpdiffcomputed from copied[M,1]running state
  • agent/example/kernels/a2/flash_attn_full.py— running sum + final sliceddivon top of the delayed numerator pipeline
  • easyasc/simulator_v2/ops/vec/v.pyandeasyasc/simulator_v2/ops/vec/_legacy_vpipe.py— current vec runtime path forcmax,brcb, anddup
  • easyasc/stub_functions/vec/group.py— cmax stub with dst_rep_stride default
  • easyasc/stub_functions/vec/dupbrcb.py— dup and brcb stubs
  • easyasc/stub_functions/vec/vecutils.pyinfer_stridesandinfer_repeatlogic

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/783225/

相关文章:

  • MES、WMS、WCS 之外,还需要一个“工业执行 Runtime”吗?
  • 台州普金办公设备:台州有实力的打印机租赁公司 - LYL仔仔
  • 拒绝“人工智障”!看销售易AI CRM如何用硬核数据征服500强 - 资讯焦点
  • 苏州蔷薇吊装搬运:苏州设备搬运公司哪家专业 - LYL仔仔
  • 【必看】2026年 {计算题} |专项解析 ~ C:三点估算
  • CANN DeepSeek Indexer注意力优化
  • 实测2家热门京东e卡回收正规平台,京回收vs猎卡回收,避坑不亏! - 京回收小程序
  • 爱马仕(Hermes)AI智能体框架完整指南:从入门到部署
  • 可解释AI如何适配人类决策模式:从理性模型到快速节俭启发式
  • 天津佳艺空间装饰:靠谱家装服务的核心实力解析 - 奔跑123
  • CANN/ops-math PadV3Grad算子
  • 2026年自贡全案整装怎么选?一站式家装与智能家居避坑完全指南 - 优质企业观察收录
  • CANN/ops-nn ELU反向梯度算子
  • 中国AI CRM厂商测评:谁能真正扛起企业智能化增长的大旗? - 资讯焦点
  • 2026年水切割常规故障处理方案:成都水刀配件厂家技术支持能力横评 - 企业名录优选推荐
  • 更简易的事件分发器
  • GWAI:一站式AI平台如何革新引力波数据分析
  • 从等保测评到威胁情报:一文读懂2026年安卓安全监测的技术内幕
  • CANN/pyasc数据块归约API
  • 多模态大模型如何重塑科学教育:从认知减负到自适应学习
  • 法律AI的确定性追求:规则引擎与形式化方法的技术实践与边界
  • 国标新标杆,护眼新高度——独语A8重塑学生读写光环境 - 资讯焦点
  • 无需专程前往金店 孝感一区三市三县全城上门收金 山区乡镇均可接单 - 金掌柜黄金回收
  • 国内高锰酸盐指数水质在线监测仪十大品牌排名 - 仪表人小余
  • CANN/pypto hypot函数
  • RimSort终极指南:三步告别环世界MOD加载混乱的免费智能管理器
  • 2026年成都水刀配件厂家全景对标:从易损件痛点到源头采购一站式解决方案 - 企业名录优选推荐
  • CANN/pyasc复制函数文档
  • GWAI:深度学习与模块化架构重塑引力波数据分析
  • 2026年邯郸美术集训画室排行榜出炉!世骅学本稳居榜首,实力口碑双标杆 - damaigeo