当前位置: 首页 > news >正文

CANN向量步幅切片约束

Vec Stride and Slicing Constraints

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when a vec operation needs to access part of a wider buffer, or when a "narrow" source (e.g. row-max buffer) must align with a "wide" destination row by row.

Goal

Decide correctly when a vec operation can run continuously over a full buffer versus when it requires sliced views or explicit stride configuration.

1. The alignment problem

Vec operations inferrepeatfrom the destination tensor and strides from each tensor'sspan/shape. When a wide buffer (e.g.[M, 128]) is paired with a narrow buffer (e.g.[M, 8]), the repeat counts may not align row-by-row.

For float (C0=8):

  • [M, 128]span1=128does not match8*C0=64orC0=8→ default strides (blk=1, rep=8)
  • Each row takes2 repeats(128 / 64 = 2)
  • [M, 8]span1=8 == C0blk=0, rep=1
  • Each row takes1 repeatfrom the narrow buffer

Ifsub(wide[M,128], wide[M,128], narrow[M,8])is called directly:

  • repeat = M * 128 / 64 = 2M(from dst)
  • narrow advances 1 per repeat → after repeat 0 (row 0 first half), narrow moves to row 1
  • row 0's second half gets row 1's value→ misaligned!

2. Fix: slice the wide buffer to 64-column views

Slicing to[M, 64]creates a view wherespan1=64 == 8*C0:

  • blk=1, rep=shape[1]//C0(e.g.128//8=16for a 128-wide parent)
  • Each row takes1 repeat→ aligns with the narrow buffer'srep=1
# Correct: sliced views ensure 1 repeat per row sub(ub[0:M, 0:64], ub[0:M, 0:64], max_buf) # first half sub(ub[0:M, 64:128], ub[0:M, 64:128], max_buf) # second half

The slice syntax creates a Tensor view with updatedspanandoffsetwhile keeping the originalshape. The stride auto-inference usesspanfor stride selection andshapeforrep_stridecalculation, which correctly skips the full row width between repeats.

3. When slicing is NOT needed

Purely element-wise operations (no narrow source) can run continuously over the full buffer:

OperationNeeds slicing?Reason
muls(wide, wide, scalar)NoScalar broadcasts uniformly
exp(wide, wide)NoSame-shape in-place, no alignment issue
cast(half_out, float_in)NoSame-shape element-wise conversion
sub(wide, wide, narrow)YesNarrow source advances 1 row/repeat
vmax(dst64, wide_half1, wide_half2)YesNeed column views of a wider buffer
brcb(wide, narrow)Explicit stridesSee brcb section

Rule: if all source and destination tensors have the samespanand are operated element-wise, no slicing is needed. If any operand has a different width (narrower), slice the wider operands to match the narrow operand's per-row repeat cadence.

4. Stride auto-inference rules

Fromvecutils.infer_strides(tensor)for float (C0=8):

span[1]Matchesblk_striderep_stride
64(= 8×C0)Yes1shape[1] // C0
8(= C0)Yes0shape[1] // C0
otherNo1 (default)8 (default)

For half (C0=16):

span[1]Matchesblk_striderep_stride
128(= 8×C0)Yes1shape[1] // C0
16(= C0)Yes0shape[1] // C0
otherNo1 (default)8 (default)

Whenspan[0] == 1and a match occurred,rep_strideis overridden to0.

infer_repeat(tensor)always uses:span[0] * span[1] / (256 // dtype.size)

5. Column slicing via Tensor views

DSL tensor slicing (tensor[row_start:row_end, col_start:col_end]) creates a view with:

  • offsetadjusted to the slice start
  • spanset to the slice extent
  • shapeinherited from the parent (full allocation width)

This meansrep_stride = shape[1] // C0correctly accounts for the full row width, whilerepeat = span[0] * span[1] // (256 // dtype_size)only covers the sliced region.

Example forub_data[0:64, 64:128]whereub_dataisTensor(float, [64, 128]):

  • span = [64, 64],shape = [64, 128],offset = [0, 64]
  • blk=1, rep=128//8=16(skips full 128-wide row)
  • repeat = 64*64/64 = 64(one repeat per row)

Files to study

  • easyasc/stub_functions/vec/vecutils.py— stride inference logic
  • easyasc/utils/Tensor.py— slice/view creation
  • agent/example/kernels/a2/flash_attn_score.py— practical use of sliced sub + continuous exp/cast

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/783023/

相关文章:

  • 远洋边缘节点运维实战:基于 Linux SSH 反向隧道与 TCP Keepalive 优化实现跨洋远程排障
  • 可解释AI(XAI)在衰老时钟模型中的应用:从黑盒预测到透明洞察
  • Vercel 405 Method Not Allowed
  • CANN 3DGS Alpha Blending优化
  • CANN PyTorch自定义算子扩展
  • 2026年05月油浸式变压器厂家推荐,专业服务更安心,高性价比变压器,优质优价之选 - 品牌推荐师
  • PyPTO分布式共享内存加载
  • CANN/ops-blas Cdot算子实现
  • 科技与科学新闻摘要-2026年5月9日
  • AI写PPT的流程
  • WeChatPad技术揭秘:如何让您的安卓手机同时登录两个微信账号?
  • 3D UNet、VNet与HighResNet在胎儿fMRI脑区分割中的对比研究
  • Ascend TensorFlow混合计算
  • 医疗AI系统风险缓解:从数据质量到临床双检的功能设计
  • CANN融合算子库实现
  • 自贡一站式家装怎么选?2026年整装品牌深度测评与老房翻新改造方案 - 优质企业观察收录
  • CANN运行时溢出检测示例
  • AI如何突破人文学科认知局限:数字人文的实践路径与技术解析
  • 三国游戏BT服无限元宝GM版
  • 5分钟快速上手:抖音批量下载工具完全使用指南
  • CANN/cann-bench MoE门控算子
  • 需求感知AI:从理解人类深层需求到构建可持续智能系统
  • 数学专业书籍推荐1:数学分析的两本经典习题书
  • SpringBootApplication注解说明
  • CANN/AMCT创建量化感知训练模型
  • 解决claude code访问不稳定问题通过taotoken配置anthropic兼容通道
  • CANN电力预测ReduceAll算子操作手册
  • CANN/ascend-transformer-boost多潜在注意力算子演示
  • 怎样高效使用网盘直链下载助手:实用技巧完整指南
  • CANN/pto-isa轴归约与扩展操作