当前位置：首页 > news >正文

华为CANN PyPTO实验性UB聚集操作

news 2026/7/10 10:38:38

pypto.experimental.gather_in_ub

【免费下载链接】pyptoPyPTO（发音: pai p-t-o）：Parallel Tensor/Tile Operation编程范式。项目地址: https://gitcode.com/cann/pypto

产品支持情况

产品	是否支持
Ascend 950PR/Ascend 950DT	√
Atlas A3 训练系列产品/Atlas A3 推理系列产品	√
Atlas A2 训练系列产品/Atlas A2 推理系列产品	√

功能说明

该接口为定制接口，约束较多。不保证稳定性。

该算子支撑稀疏注意力机制，能力为将选中的 token 的 kv cache 从 GM（Global Memory）加载到UB（Unified Buffer）中，支持Page Attention。

函数原型

gather_in_ub(param: Tensor, indices: Tensor, block_table: Tensor, block_size: int, axis: int) -> Tensor

参数说明

参数名	输入/输出	说明
param	输入	源操作数。支持的数据类型为：DT_FP32, DT_FP16。不支持空Tensor，支持两维。在实际使用中表示kv cache，形状为[token_size,hidden_dim]。
indices	输入	源操作数。支持的数据类型为DT_INT32。不支持空Tensor，支持两维。在实际使用中表示为topk输出结果，形状为[1,k]。
block_table	输入	源操作数。支持的数据类型为DT_INT32。不支持空Tensor，支持两维。在实际使用中表示为 Page Attention 中的页表，形状为[1,block_table_size]，其中block_table_size表示页表的长度。
block_size	输入	源操作数。 int 类型。表示 Page Attention 中一个块可以放多少个token。
axis	输入	源操作数。 int 类型。只支持-2轴。

返回值说明

返回输出 Tensor，Tensor的数据类型和 param 相同，Shape 为[k, hidden_dim]，即选中 token kv cache。

调用示例

TileShape设置示例

调用该operation接口前，应通过set_vec_tile_shapes设置TileShape。

TileShape 的维度设置须与输出张量保持一致，用于控制输出 Tile 块的大小。

以输入$ param[token_size,hidden_dim]$ 、索引 $indices[1,k]$ 、轴 $\text{axis}=-2$ 、输出 $output[k,hidden_dim]$ 为例：

设 TileShape 为$[k_1, hidden_dim_1]$，该配置直接作用于输出 output 的各维度，同时映射至输入与索引。其中 $k_1$ 切分 indices 的 k 维，$ hidden_dim_1$ 切分 param 的特征维 $hidden_dim$ 。Tile 内存占用须满足约束 $b_1 \cdot k_1 \cdot hidden_dim_1 \cdot \text{sizeof}(\mathbf{output}) < \text{UB_Size}$

接口调用示例

考虑以上场景，indices为topk结果，block_table为Page Attention的页表，param 为 kv cache，block_size为2。最终的结果是将token的kv cache 收集起来。

以 token id 4为例（在图中标红），根据blockSize计算出实际偏移：

blockIdx = 4 / 2; //计算对应的逻辑块，第2个逻辑块

tail = 4 % 2; //计算块内偏移，偏移为0

slcBlockIdx = blockTable[0, blockIdxInBatch]; //查表，得到该块实际偏移，对应第1个物理块

offsets = slcBlockIdx * blockSize + tail;//计算出实际的偏移，为2

对数据进行搬运

param = pypto.tensor([6, 4], pypto.DT_FP32) indices = pypto.tensor([1, 3], pypto.DT_INT32) blockTable = pypto.tensor([1, 3], pypto.DT_INT32) blockSize = 2 axis = -2 result = pypto.experimental.gather_in_ub(param , indices , blockTable, blockSize , axis)

结果示例如下：

输入数据param : [ # token 0 [ 0, 1, 2, 3], # token 1 [ 10, 11, 12, 13], # token 2 [ 20, 21, 22, 23], # token 3 [ 30, 31, 32, 33], # token 4 [ 40, 41, 42, 43], # token 5 [ 50, 51, 52, 53], ] 输入数据indices : [0, 4, 3] 输入数据blockTable : [0, 2, 1] 输出数据out: [ [ 0, 1, 2, 3], [ 20, 21, 22, 23], [ 50, 51, 52, 53], ]

【免费下载链接】pyptoPyPTO（发音: pai p-t-o）：Parallel Tensor/Tile Operation编程范式。项目地址: https://gitcode.com/cann/pypto

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

查看全文

http://www.jsqmd.com/news/783098/