当前位置: 首页 > news >正文

CANN/cannbot-skills Cube-Vec模式

Cube-to-Vec-to-Cube-to-Vec Pattern

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Generic baseline only. For a2 (b3) kernels, preferagent/references/patterns/a2-cube-vec-cube-vec.md(and the softmax varianta2-cube-vec-cube-vec-softmax.md), which add delayed-consumer and running-statistic rules specific to a2.

Read this file when one cube stage feeds vec logic, then another cube stage, then a final vec stage. This is the highest-complexity staged pattern currently worth documenting as a dedicated route.

Use this pattern when

  • there are at least two cube-heavy stages with vec-side logic between them
  • one tile may be produced in one iteration and consumed in a later iteration
  • delayed state such as softmax stats or rescale factors must follow the consumer lifetime

Minimal flow

cube stage 1 -> vec stage 1 -> cube stage 2 -> vec stage 2

In practice this often becomes a one-tile lookahead schedule with warmup and drain.

What usually matters most

  • keeping producer and delayed consumer lifetimes separate
  • giving delayed stages their own counters
  • deciding whether the bridge should stay on chip or go through GM workspace
  • keeping scalar state aligned with the delayed consumer, not with the original producer
  • validating each stage before trusting the fused version

Stable repository lessons

  • if stage 2 reuses a stage 1 operand one iteration later, keep that operand on chip when the lifetime fits
  • if the reuse does not fit cleanly, materialize an explicit GM workspace instead of forcing a fake on-chip story
  • do not normalize too early when the numerator and denominator streams must both finish first
  • when the live query side is truly one row, flatten(B, H)into oneBHaxis and keeprows=1instead of forcing a wider row tile
  • for half-inputBASES=256attention on a5, keep the outer256tile in L1, usesplitk=64forq @ k.t(), andsplitn=64forp @ v
  • for fp8 decode attention with external scales, mask invalid tail columns to-infbeforerowmax, scale the probability tile only after the floatrow_sumupdate, and compensate with a finalscale_v / P_SCALE
  • if the delayed cube consumer wants packed-NZ input, pack the vec-produced tile in UB first, then publish that NZ view intoL1

One-tile lookahead scheduling detail

The retained MLA kernel (agent/example/kernels/a5/test_mla_entire.py) uses a four-stage on-chip flow:

  1. cube: produce score tilei
  2. vec: update streaming softmax state and cast score tileito probability tilei
  3. cube: consume delayed probability tilei-1with the matching value/key tile
  4. vec: rescale and accumulate the delayed output tilei-1

Stable control pattern:for s in range(0, S + TILE, TILE)with:

  • if s < S: producer side (warmup + steady state)
  • if s > 0: delayed consumer side (steady state + drain)

On-chip operand reuse:

  • if stage 2 must reuse a stage 1 operand one iteration later, keep that operand resident on chip instead of round-tripping to GM
  • in the MLA kernel,k_nopestays inl1knand the vec-producedptile is published directly intol1p
  • inagent/example/kernels/a5/mha_ifa_nz.py, the vec-producedptile is first packed withreg_to_ub(...)and then published tol1pas.nz()

Delayed scalar state:

  • delayed scalar state must follow the consumer lifetime, not the producer lifetime
  • cache per-tilerow_exp_diff/ rescale factors in a slot indexed by the delayed consumer counter
  • keep runningrow_max,row_sum, andoutput_accunder a single vec owner to avoid duplicate updates

Typical files to study

  • agent/example/kernels/a5/test_mla_entire.py
  • agent/example/kernels/a5/mha_ifa.py
  • agent/example/kernels/a5/mha_ifa_256.py
  • agent/example/kernels/a5/mha_ifa_fp8_scale_256.py
  • agent/example/kernels/a5/mha_ifa_nz.py

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/785035/

相关文章:

  • 基于深度强化学习的AIGC语义通信资源分配优化框架详解
  • APA 7th Edition终极指南:三步解决Word参考文献格式混乱问题
  • TensorFlow 模型测试与验证:10个顶级自动化测试框架终极指南 [特殊字符]
  • 医疗AIoT脑肿瘤检测:集成学习与可解释AI的融合实践
  • 如何快速掌握Python异步条件变量协议:asyncio.Condition完整指南
  • 如何高效使用XUnity自动翻译器:游戏本地化终极指南
  • FPGA内存接口设计:挑战、方案与优化实践
  • 能量阀怎么选
  • AI驱动零售需求预测与全渠道优化:应对突变与数据挑战
  • 终极指南:Koel音乐流平台的安全架构分析与用户数据保护机制
  • 如何快速实现commitlint可视化配置:终极Web界面解决方案指南
  • NBitcoin BIP39教程:使用助记词保护你的比特币资产终极指南 [特殊字符]
  • 基于NIST框架的健康AI算法偏见治理:从理论到工程实践
  • AI智能体思维可视化直播:streamYourClaw架构解析与实战部署
  • 一句话木马+蚁剑
  • BinaryEye条码生成教程:从文本到QR码的完整创建流程
  • OpenClaw从入门到应用——工具(Tools):浏览器登录
  • 深度解析ChatPaper的5大局限性:AI辅助科研工具的潜在不足与应对策略
  • AI重塑知识经济:从工具到新基建,人机协同如何重构工作价值链
  • AI赋能人才分析:从数据清洗到算法应用的全景解析
  • Deep-Research农业科技:终极指南 - 如何通过AI深度研究提升种植技术与产量优化 [特殊字符]
  • TokenTracker:基于事件监听的以太坊代币转账实时追踪工具实战
  • SD-WebUI-Inpaint-Anything 插件:自定义修复模型终极配置指南
  • 专栏导航——「Java基础系列」全索引
  • 2025届毕业生推荐的六大降AI率平台推荐榜单
  • 面向视障用户的可访问AI解释技术:设计原则与多模态实现
  • 终极Node-Redis容量规划指南:存储需求预估与性能优化全攻略
  • CANN/pyasc矩阵乘法N批处理迭代
  • 如何快速配置hitch:从基础安装到第一个TLS连接的完整指南
  • 2026上海CS认证跨级申报新规全解读 - 品牌企业推荐师(官方)