当前位置: 首页 > news >正文

CANN/PTO-ISA自定义算子示例

Custom PyTorch Operator (KERNEL_LAUNCH) Example

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

This example shows how to implement a custom PTO-based kernel and expose it as a PyTorch operator viatorch_npu.

Directory Layout

demos/baseline/add/ ├── op_extension/ # Python package entry (module loader) ├── csrc/ │ ├── kernel/ # PTO kernel implementation │ └── host/ # Host-side PyTorch operator registration ├── test/ # Minimal Python test ├── CMakeLists.txt # Build configuration ├── setup.py # Wheel build script └── README.md # This document

1. Implement the kernel

Add a kernel source file underdemos/baseline/add/csrc/kernel/and include it in the build. For example, to buildadd_custom.cpp, add it todemos/baseline/add/CMakeLists.txt:

ascendc_library(no_workspace_kernel STATIC csrc/kernel/add_custom.cpp )

For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c

2. Integrate with PyTorch (torch_npu)

The host-side implementation lives underdemos/baseline/add/csrc/host/.

2.1 Define the operator schema (Aten IR)

PyTorch usesTORCH_LIBRARY/TORCH_LIBRARY_FRAGMENTto declare operator schemas that can be called from Python viatorch.ops.<namespace>.<op_name>.

Example: register a custommy_addoperator in thenpunamespace:

TORCH_LIBRARY_FRAGMENT(npu, m) { m.def("my_add(Tensor x, Tensor y) -> Tensor"); }

After this, Python can calltorch.ops.npu.my_add.

2.2 Implement the operator

  1. Include the generated kernel launch headeraclrtlaunch_<kernel_name>.h(generated by the build system).
  2. Allocate output tensors/workspace as needed.
  3. Enqueue the kernel viaACLRT_LAUNCH_KERNEL(wrapped byEXEC_KERNEL_CMDin this example).
#include "utils.h" #include "aclrtlaunch_add_custom.h" at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { at::Tensor z = at::empty_like(x); uint32_t blockDim = 20; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength); return z; }

2.3 Register the implementation

Register the implementation withTORCH_LIBRARY_IMPL. For NPU execution,torch_npuuses thePrivateUse1dispatch key, please find the detailed introcution ofPrivateUse1on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html

TORCH_LIBRARY_IMPL(npu, PrivateUse1, m) { m.impl("my_add", TORCH_FN(run_add_custom)); }

3. Build and run

This example requires PTO Tile Lib, PyTorch,torch_npu, and CANN. Follow the officialtorch_npuinstallation guide:

https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85

or

python3 -m pip install -r requirements.txt

3.1 Set the target SoC

Editdemos/baseline/add/CMakeLists.txtand setSOC_VERSIONto your target (example: A2A3 usesAscend910B1):

set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")

You can query the chip name on the target machine vianpu_smi infoand useAscend<Chip Name>as the value.

3.2 Build the wheel

Set the PTO Tile Lib path and build a wheel:

export ASCEND_HOME_PATH=/usr/local/Ascend/ source /usr/local/Ascend/ascend-toolkit/set_env.sh export PTO_LIB_PATH=[YOUR_PATH]/pto-isa rm -rf build op_extension.egg-info python3 setup.py bdist_wheel

3.3 Install the wheel

cd dist pip uninstall *.whl pip install *.whl

3.4 Run the test

cd test python3 test.py

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/784824/

相关文章:

  • Taotoken多模型聚合平台助力智能客服场景降本增效
  • CANN/AMCT API接口文档
  • 去中心化AI架构解析:从区块链信任到分布式AI协作网络
  • 在Nodejs后端服务中集成稳定可靠的大模型调用能力
  • CANN/cannbot-skills A5设备约束指南
  • 2026届必备的六大降AI率助手实测分析
  • 自监督学习、能量模型与JEPA:构建下一代AI世界模型的核心技术
  • CANN社区机器人能力列表
  • 多模态大模型赋能港口,从视频孪生迈向空间原生智能
  • Phi-4-Reasoning-Vision商业应用:电商商品图深度解析+卖点自动生成方案
  • AI优化疫苗接种干预:ADVISER框架在尼日利亚公共卫生最后一公里的实践
  • FireRedASR-AED-L入门必看:1.1B参数大模型本地化部署全流程
  • 如何快速掌握鼠标键盘自动化:KeymouseGo完整入门指南
  • 全面掌握Windows驱动管理:DriverStore Explorer实战指南
  • 3分钟掌握微信聊天记录解密:WechatDecrypt让你的数据重获自由
  • CAPL编程避坑指南:搞懂NetWork Node里的全局变量、文件包含与编译那些事儿
  • 律师上课记干货太吃力!2026年3款b站视频怎么转文字工具,1分钟导出整理办案笔记
  • CANN/catlass 逐令牌反量化
  • 等变神经网络:用群论与表示论构建具备对称性先验的AI模型
  • 如何快速掌握Video DownloadHelper CoApp:新手入门完整指南
  • CANN/catccos AllGather反量化算子
  • CANN/ATVC ACLNN调用示例
  • 从SPI到8080:一文搞懂MIPI DBI(Type C)如何驱动你的LCD屏并优化帧率
  • CANN/AMCT KV-Cache量化模型创建
  • 乡村全科执业医师培训机构哪个好?这份2026最新调研报告告诉你 - 医考机构品牌测评专家
  • RT2.0 动态 Shape 执行器特性分析
  • 从“算力竞赛”到“业务落地”:AI营销一体机选型的几点思考
  • Java老兵转型AI开发实战指南:收藏这份从零到精通的学习路线,小白也能快速上手大模型
  • 2026年4月靠谱的通风蝶阀厂家推荐,电动组合风阀/岗位轴流风机/吊顶式空调机组/通风蝶阀,通风蝶阀门店找哪家 - 品牌推荐师
  • 避坑指南:在CentOS7上为TensorFlow2.6搭建Python3.8环境,我踩过的那些‘依赖’雷