当前位置: 首页 > news >正文

CANN 数据移动约束

Datamove Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a kernel needs to move data between GM, UB, L1, or L0 using non-trivial transfer patterns.

Goal

Choose the right datamove recipe so that:

  • the publish path matches the downstream consumer's expected layout
  • unaligned widths are handled by padding rather than by shrinking local tensors
  • strided gathers avoid unnecessary host-sidepermuteorexpand
  • internal workspace bridges stay explicit when on-chip reuse does not fit

1. ND publish (ub_to_l1_nd2nz)

Best for straightforward vec preprocess + cube consume.

  • write subblock rows into UB, then publish with explicitm_dst/n_dst/m_src/n_src
  • keep row mapping consistent withGetSubBlockIdx()
  • in general vec preprocess, split into two half ranges for the two vector sides:
    • half_rows = CeilDiv(total_rows, 2)
    • vector side 0 handles[0:half_rows]
    • vector side 1 handles[half_rows:total_rows]
    • publish each half independently to the matching L1 row slice

Files to study:

  • agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py

2. NZ publish (ub.nz())

Use when input is already packed for NZ path. Common flow:

  • do vec compute in ND register form
  • pack to NZ-friendly UB layout (deinterleave,reg_to_ub)
  • publish withl1 <<= ub.nz()

Files to study:

  • agent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.py

3. Unaligned width handling

For unaligned GM widths, allocate UB second dim to aligned width and rely on padded transfer behavior. Do not shrink the UB tensor shape to the logical width.

For narrow a5 vec-only row kernels, a useful specialization is:

  • keep the logical host contract as[rows, H]
  • whenH < 64, still stage the chunk in UB as[rows, 64]
  • usegm_to_ub_pad(..., burst_len_element=H, dst_stride=(64 - H) / C0)to zero-pad each row on load
  • run the same@vf()row logic againstrow_stride = 64
  • write back withub_to_gm_pad(..., burst_len_element=H, src_stride=(64 - H) / C0)so only the logical columns return to GM
  • this is a good fit when the vec math is row-recursive and you want one shared@vf()body for both wide rows and narrowH < 64rows

Practical limit:

  • for float32, this padding shortcut is cleanest when the row-width gap is expressible inC0=8units
  • it does not solve the wider-column tail case by itself whenH >= 64butH % 64 != 0

Files to study:

  • agent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.py
  • agent/example/kernels/a5/chunk_row_cumsum.py

4. Strided GM gather without hostpermute

When logical rows are separated by a fixed stride in flattened GM, usegm_to_ub_paddirectly:

  • setn_burstto the number of logical rows
  • setburst_len_elementto the contiguous row width
  • setsrc_stride_elementtofull_row_step - burst_len_element
  • keepdst_stride=0when the UB row shape already matches the aligned burst footprint

This is the main way to preserve a reshape-only host contract for attention-style layouts such askey:[B,S,H,D]andprob:[BH,S].

5. Internal workspace bridge for single-kernel fusion

If one kernel stage produces data onMTE3and a later stage must reread it throughMTE2, materialize that intermediate in GM workspace instead of trying to keep it purely local.

Stable attention pattern:

  • keepqk_tmp:[BH,S]as float workspace for the three-pass softmax
  • storep.half()intoprob_tmp:[BH,S]workspace
  • add an explicit stage boundary before reloadingprob_tmp
  • perform the final value scaling from that half workspace so thep.half().float()contract stays exact

For the final vec-onlyprob_tmp -> value -> outstage:

  • keep the whole nested reload/compute/writeback chain inside one outerauto_sync()
  • make DBuff slot ownership explicit through the ready/valid handshake rule
  • verify both simulator execution and generated C++ declarations before removing manual barriers

If the delayed reuse fits in one tile of on-chip lifetime, prefer an on-chip lookahead bridge:

  • keep the stage-1 operand needed again by stage-2 resident inL1/TBuff
  • publish the vec-produced fp8 probability tile directly into anL1slot for the second cube matmul
  • buffer per-tile rescale state in the same delayed slot family as the later consumer

Do not republish a freshly packed fp8 UB tile straight to L1 when exact downstream reuse matters; the packed UB layout can differ from the ND view expected by the later cube path.

Files to study:

  • agent/example/kernels/a5/test_mla_entire.py

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/785196/

相关文章:

  • 陕西公考培训新范式:系统化教学与协同服务体系解析 - 资讯焦点
  • 前端性能优化终极指南:如何利用WebAssembly实现高性能计算
  • AI工具搭建自动化视频生成Asana
  • AI Agent全栈开发框架:架构先行与垂直切片验证实践
  • 收藏!2026年普通人也能干的5个高薪AI新职业(无需代码,小白也能学)
  • 2026年降AI工具维普专项实测:五款工具维普AIGC检测通过率完整横向对比分析
  • 2026广东狐臭医生口碑测评:5位高性价比医生推荐 - 速递信息
  • 在团队开发中统一大模型调用配置与密钥管理的实践
  • 一天一个开源项目(第96篇):OpenHarness - 轻量级 AI 代理基础设施框架
  • Classiq量子编程平台:5分钟快速入门量子计算
  • 2026口腔答疑测评!牙黄口臭牙结石怎么救?美白去黄清新溶石牙膏推荐 - 资讯焦点
  • Node _ 初学版
  • 专业滑雪服工厂推荐:5C全链条方法论破解高端定制痛点 - 速递信息
  • Engram:基于Arweave与端到端加密的去中心化个人知识管理实战
  • 终极指南:ChatGPT微盘股实验如何应对极端市场压力?6个月真实数据深度解析
  • Front-End-Performance-Checklist错误处理终极指南:10个关键性能监控与异常捕获技巧
  • ChatGPT微盘股实验:AI投资组合管理的终极实战指南 [特殊字符]
  • 效率提升10倍 闪光对焊机破解车轴焊接难题 - 速递信息
  • 2026年3月头部钢筋网片生产厂推荐,市面上钢筋网片找哪家优质品牌选购指南 - 品牌推荐师
  • GRU与注意力机制在ICU多重耐药菌感染预测中的实战应用
  • Bend并行编程安全指南:大规模并行计算中的10个关键安全实践
  • 基于MERN全栈与OpenAI API构建类ChatGPT应用的实战指南
  • 目标检测模型的训练方法(比赛用)(insects数据集)
  • 2026年4月无人机培训机构推荐,无人机电子执照考证/无人机操作员考证/无人机执照报考,无人机培训学校哪家强 - 品牌推荐师
  • 深圳保利德制冷:深耕工业制冷近二十年,以科技“冻”力赋能全球工业制造 - 速递信息
  • Go语言打造Minecraft服务器自动化运维管道:事件驱动与任务编排实战
  • Python整数有上限吗?揭秘动态大整数的原理与工程边界
  • 终极指南:如何快速掌握多语言NLP资源与实战技巧
  • 不自生,故长生,SAP BTP 开发里的长久之道
  • PathAsst:多模态生成式AI如何革新病理诊断与报告生成