当前位置: 首页 > news >正文

CANN EasyAsc DSL a2 Cube-Vec-Cube-Vec模式

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Normalized Online Softmax)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

  • one cube stage that produces a score tile
  • vec logic that updates running row max and running row sum
  • a later cube stage that consumes the delayed probability tile
  • a final vec stage that accumulates the delayed cube output
  • one final vec-only divide by the accumulated row sum

Typical target formula:

  • score_j = q.float() @ k_j.float().t() * scale
  • curr_m = maximum(prev_m, rowmax(score_j))
  • expdiff_j = exp(prev_m - curr_m)
  • p_j = exp(score_j - curr_m)
  • row_sum = row_sum * expdiff_j + p_j.sum(-1)
  • pv_j = p_j.half().float() @ v_j.float()
  • out = out * expdiff_j + pv_j
  • out = out / row_sum

This is the normalized counterpart toa2-cube-vec-cube-vec.md. Use that older pattern only when the kernel stops at the unnormalized numerator.

One-page route for the common case

If this file matches your contract, donotpreload all of:

  • agent/references/constraints/reduction.md
  • agent/references/constraints/vec-reduction-a2.md
  • agent/references/constraints/vec-stride.md
  • agent/references/constraints/online-softmax-tail.md

This page now owns the common normalized-online-softmax authoring rules. Open the smaller constraint pages only when a specific failure mode still remains unclear after this file.

Why this needs its own a2 pattern

The a2 hardware constraints are the same as the unnormalized case:

  • cube -> vec cannot usel0c_to_ub
  • vec -> cube cannot useub_to_l1_*
  • delayed cube output must come back to vec for final accumulation

But normalized online softmax adds two stability-sensitive requirements:

  • runningrow_summust be updated from the floatexp(...)tile before any cast to half
  • the final divide must happen only once, after all delayed numerator tiles have been accumulated

So the stable a2 flow is:

GM(q,k,v) -> L1 -> L0 -> L0C(score) -> GM(score_ws) -> UB(score)-> vec(max, expdiff, exp, row_sum, cast p) -> GM(p_ws) -> L1 -> L0 -> L0C(pv)-> GM(pv_ws) -> UB(pv) -> UB(accum) -> final UB divide by row_sum -> GM(out)

Workspaces and ownership edges

Use the same three GM workspaces as the unnormalized pattern:

  1. score_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:L0C(score)->UB(score)
  2. p_ws

    • dtype:half
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:UB(p_j.half())->L1(p_j)
  3. pv_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, D]
    • purpose:L0C(pv_j)->UB(pv_j)

Ownership edges:

  • stage 1 cube -> vec:CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)
  • stage 1 vec -> stage 2 cube:VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)
  • stage 2 cube -> stage 3 vec:CvMutex(2, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Stable schedule

Use the same one-tile lookahead loop as the unnormalized pattern:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce tile j = ni if ni > 0: # stage 2 + stage 3: consume tile j = ni - 1

That gives:

  • warmup: first iteration only produces
  • steady state: producejwhile consumingj - 1
  • drain: final iteration only consumes the last delayed tile

SharedL0Crule

Reuse one physicalL0Cfamily across the two cube stages.

This is the same capacity-driven choice as the unnormalized pattern:

  • stage 1 needs float[TILE_M, TILE_N]
  • stage 2 needs float[TILE_M, D]with validatedD == 128
  • a2 still has only128 KBL0C

Keep one sharedl0c_cnt, but do not merge unrelated counters just becauseL0Cis shared.

Counter layout

Keep these lifetimes separate:

  • l1qk_cnt: stage-1q/kloads
  • l1pv_cnt: stage-2p/vloads
  • l0c_cnt: shared physicalL0Cfamily across the two cube stages
  • stage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiff
  • stage2_cnt: delayed slot rhythm forp_wsconsumption andpv_ws

Runningrow_sumdoes not need its own delayed counter. It stays vec-resident for the whole inner loop and updates immediately in stage 1.

Vec-resident persistent state

Keep these values in per-subblock UB across the whole inner loop:

  • running row max:[HALF_M, 1]
  • running row sum:[HALF_M, 1]
  • delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)
  • final numerator accumulation:[HALF_M, D]

UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.

Stable stage-1 update order

The normalized online update order matters:

  1. computerowmax(score_j)in[HALF_M, 1]
  2. snapshotprev_minto the delayedexpdiffslot withadd(..., zero)
  3. updaterunning_max = maximum(running_max, tile_max)
  4. turn the delayed slot intoexp(prev_m - curr_m)
  5. broadcastrunning_maxand subtract from the score tile
  6. compute the float probability tilep_j = exp(score_j - curr_m)
  7. reducesum_jfrom that float tile withadd+cadd
  8. updaterunning_sum = running_sum * expdiff_j + sum_jin[HALF_M, 1]
  9. castp_jtohalfonly now, because stage 2 wants the exactp_j.half().float()contract

Do not move the row-sum update after the cast. That would silently change the reference contract.

Vec rules you usually need without extra docs

For the commonTILE_N = 128,D = 128path, the usual extra questions are already answered here:

  1. keeprunning_max,running_sum, and delayedexpdiffin scalar format[HALF_M, 1]
  2. snapshot scalar state withadd(dst, src, zero), notub_to_ub
  3. cmax/caddoutput dense scalars, so broadcast them with:
    • brcb(dst, src, dst_blk_stride=1, dst_rep_stride=8)
  4. when a wide[HALF_M, 128]buffer is paired with a narrow[HALF_M, 8]broadcast row, operate on:
    • buf[:, 0:64]
    • buf[:, 64:128]rather than on the full 128-column view in one vec call
  5. updaterunning_sumfrom the floatp_jtile before any cast tohalforhif8
  6. for non-alignedS2, invalidate score columns beforecmaxwith a sufficiently negative finite sentinel;valid_non the GM load alone is not enough

These six rules cover the usual reasons people would otherwise open the separate reduction, vec-reduction, vec-stride, and tail files.

Critical scalar-state rule on a2

Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.

That applies to both:

  • prev_m
  • any temporary scalar snapshot you might be tempted to use forrow_sum

Useadd(dst, src, zero)for scalar-format copies, and keep bothrunning_maxandrunning_sumin[M,1]format until you explicitly need a broadcast.

Final vec accumulation and divide

Stage 3 still matches the unnormalized pattern:

  1. load delayedpv_jback into UB
  2. brcbthe delayedexpdiffslot to[HALF_M, 8]
  3. scale the two 64-column halves ofaccum
  4. add(accum, accum, pv_j)

After the inner loop finishes:

  1. brcbthe finalrunning_sumto[HALF_M, 8]
  2. div(accum[:, 0:64], accum[:, 0:64], row_sum_broadcast)
  3. div(accum[:, 64:128], accum[:, 64:128], row_sum_broadcast)
  4. write the normalized result to GM

Why the divide happens at the end:

  • accummust finish all delayedpv_jcontributions first
  • row_sumis the denominator for the whole streamed softmax, not one tile

Extending the pattern to non-alignedS2

The initial validated contract for this pattern keptS2 % 128 == 0so the first implementation could ignore score-tail masking.

WhenS2is not aligned, donotstop at GM-boundaryvalid_nslicing. For normalized online softmax, padded score columns can still corrupt:

  • rowmax(score_j)
  • curr_m
  • delayedexpdiff
  • row_sum

Stable rule:

  • loadk/vthroughvalid_n
  • keep local score buffers full-sized
  • beforecmax, force invalid score columns to behave like-inf
  • when materializing that mask, use a sufficiently large finite negative fill value instead of literal-inf
  • afterexp, those same columns naturally behave like0

For the currentTILE_N = 128layout, the simplest a2 implementation is:

  • split the score tile into two[HALF_M, 64]halves
  • use vec mask + finite-negativedup(...)on the affected half
  • recomputeprev_valid_nfor the delayedvload in stage 2

Read next for the exact rule and mask-construction trick:

  • agent/references/constraints/online-softmax-tail.md

Validation target

Keep the first validated contract narrow:

  • D == 128
  • S1 % 128 == 0
  • S2 % 128 == 0
  • inputq/k/varefloat16
  • output isfloat32

Suggested cases:

  1. (1, 3, 256, 256, 128)for the smallest two-tile online update
  2. (1, 1, 256, 512, 128)
  3. (1, 3, 256, 512, 128)
  4. (1, 3, 2048, 4096, 128)

For non-alignedS2extensions, add at least:

  1. one aligned baseline:S2 % 128 == 0
  2. one left-half tail:S2 % 128 == 10
  3. one cross-boundary case:S2 % 128 == 65
  4. one mid-right-half case:S2 % 128 == 96
  5. one last-column case:S2 % 128 == 127

Files to study / deeper fallbacks

  • agent/example/kernels/a2/flash_attn_full.py
  • agent/example/kernels/a2/flash_attn_unnorm.py
  • agent/example/kernels/a2/flash_attn_score_pv.py
  • agent/references/patterns/a2-cube-vec-cube-vec.md
  • agent/references/constraints/reduction.md— fallback only when the online update order is still unclear
  • agent/references/constraints/vec-reduction-a2.md— fallback only when thecmax/cadd -> brcbdetail is still unclear
  • agent/references/constraints/vec-stride.md— fallback only when a sliced wide/narrow vec op is still unclear
  • agent/references/constraints/online-softmax-tail.md— fallback only when the non-alignedS2mask construction itself is the question

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/940048/

相关文章:

  • TradingAgents-CN智能交易框架实战指南:5步快速搭建多智能体量化分析平台
  • 2026年热门的无锡电子污水处理/印染污水处理公司哪家好 - 品牌宣传支持者
  • 03 华为 harmonyos tcp 客户端 实现使用 模拟器亲测可行
  • llama-160m-openmind开发者指南:自定义训练与模型微调
  • 高数函数定义域避坑指南:从‘狗不能为零’到‘整体思想’,手把手教你识别并解决3大易错题型
  • 保姆级教程:在银河麒麟V10 SP3 ARM64服务器上,用yum downloadonly搞定Docker 26.1离线安装包
  • 建筑平台JS逆向
  • YOLOv5中文标签实战:用自定义数据集训练一个‘中文版‘安全帽检测模型(附完整代码)
  • 手把手教你用Wireshark抓包,搞定CANoe‘No TCP/IP Stack’模式下的数据监控
  • STM32F407调试神器:用CubeMX+Keil5快速搞定串口printf打印(避坑指南)
  • 数据科学实战:从问题定义到成果展示的完整项目流程解析
  • 2026年比较好的屠宰污水处理/无锡深度污水处理/中水回用污水处理优质公司推荐 - 行业平台推荐
  • 数字权益卡:企业营销新利器
  • Matlab一键运行的PSO优化BP神经网络回归预测工具包(含示例数据与全流程可视化)
  • 保姆级教程:用UE5材质系统手搓一个下雨天水坑的真实涟漪(附完整节点图)
  • 抖音直播数据抓取神器:5分钟快速上手实时弹幕监控工具
  • Linux下用libuvc驱动USB摄像头:从权限问题到实时视频流的保姆级避坑指南
  • OpCore-Simplify:智能硬件识别与自动化EFI配置引擎深度解析
  • 技术行动与学术传承:从数据密集型研究到区域创新生态构建
  • 为什么ChatGLM、LLaMA都用RoPE,而不用ALiBi?从模型选型实战聊聊位置编码的取舍
  • AD7705高精度模数转换硬件设计全套源文件(Altium工程含多版PCB与原理图)
  • BitCPM-CANN与MiniCPM4对比:三值量化模型vs全精度模型的全面性能评估
  • FastJson2.0.49 + Spring 6整合指南:手把手配置HttpMessageConverter(附常见错误排查)
  • 【算法】宽度优先遍历(BFS)
  • 分立元器件(阻容感)
  • 如何用Pulover‘s Macro Creator实现Windows自动化:完全指南
  • C++11 特殊类设计 与 四种类型转换 的深度技术详解
  • 告别示教器手动调试:用KAREL程序实现FANUC机器人SOCKET自动连接(附完整.KL源码)
  • Elsevier Tracker:科研投稿状态追踪的实用指南
  • 2026年优秀的路沿石塑料模具/立柱塑料模具可靠供应商推荐 - 行业平台推荐