当前位置: 首页 > news >正文

CANN/pto-isa GEMM示例

Basic GEMM Operator Example

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

Overview

This example demonstrates how to implement a basic GEMM operator using PTO and expose it as a PyTorch operator viatorch_npu.

Supported AI Processors

  • A2/A3

Directory Layout

demos/baseline/gemm_basic/ ├── op_extension/ # Python package entry (module loader) ├── csrc/ │ ├── kernel/ # PTO kernel implementation │ └── host/ # Host-side PyTorch operator registration ├── test/ # Minimal Python test ├── CMakeLists.txt # Build configuration ├── setup.py # Wheel build script └── README.md # This document

Operator Description

Function

This example implements GEMM with fixed dimensions[m, k, n] = [512, 2048, 1536]:

$$ C = A \times B $$

Where:

  • Ashape:[512, 2048](m × k)
  • Bshape:[2048, 1536](k × n)
  • Cshape:[512, 1536](m × n)

Specification

ItemValue
OpTypegemm
Inputsa:m×k,float16,ND;b:k×n,float16,DN
Outputc:m×n,float,ND
Kernel namegemm_basic_custom

Tiling Parameters

The validation platform has 24 cores. The workload is split across cores (prioritizing splittingmandn) using a4 × 6grouping: splitminto 4 parts andninto 6 parts to fully utilize 24 cores.

Per-core shape:

  • singleCoreM = 128,singleCoreK = 2048,singleCoreN = 256

Because the per-core tile still exceeds L0 capacity,kis further tiled into base blocks of size 64. The base block is:

  • baseM = 128,baseK = 64,baseN = 256
ParameterValue
m512
k2048
n1536
singleCoreM128
singleCoreK2048
singleCoreN256
baseM128
baseK64
baseN256

Implementation Notes

Type definitions

The implementation defines matrix representations for GM, L1, and L0, then assigns backing storage for tiles. Example (simplified):

using NDValidShapeA = TileShape2D<U, baseM, baseK>; using NDsingleCoreShapeA = BaseShape2D<U, M, K>; using GlobalDataSrcA = GlobalTensor<U, NDValidShapeA, NDsingleCoreShapeA>; // A in GM (ND) using NDValidShapeB = TileShape2D<U, baseK, baseN, Layout::DN>; using NDsingleCoreShapeB = BaseShape2D<U, K, N, Layout::DN>; using GlobalDataSrcB = GlobalTensor<U, NDValidShapeB, NDsingleCoreShapeB, Layout::DN>; // B in GM (DN) using NDValidShapeC = TileShape2D<T, baseM, baseN>; using NDWholeShapeC = BaseShape2D<T, M, N>; using GlobalDataOut = GlobalTensor<T, NDValidShapeC, NDWholeShapeC>; // C in GM

Pipeline scheduling

This example overlaps data movement and compute using double buffering in L1 and L0 to improve utilization. Synchronization points ensure correct dependencies, including:

  • Forward sync:MTE2 -> MTE1,MTE1 -> MMAD,MMAD -> FIXPIPE
  • Reverse sync:MTE1 -> MTE2,MMAD -> MTE1

Pipeline overview:

Build and Run

1. Prepare the python environment

Create your own virtual environment and install the required python package.

python -m venv virEnv source virEnv/bin/activate python3 -m pip install -r requirements.txt

2. Configure environment and build the wheel

export ASCEND_HOME_PATH=/usr/local/Ascend/ source ${ASCEND_INSTALL_PATH}/bin/setenv.bash export PTO_LIB_PATH=[YOUR_PATH]/pto-isa rm -rf build op_extension.egg-info python3 setup.py bdist_wheel

3. Install the wheel

pip install dist/*.whl

4. Run the example

cd test python3 test.py

【免费下载链接】pto-isaParallel Tile Operation (PTO) is a virtual instruction set architecture designed by Ascend CANN, focusing on tile-level operations. This repository offers high-performance, cross-platform tile operations across Ascend platforms.项目地址: https://gitcode.com/cann/pto-isa

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/784620/

相关文章:

  • ARM中断线桥(IWB)架构与中断处理机制详解
  • CANN/cann-bench: ForeachNorm算子
  • NetBox硬件代理:自动化数据中心资产发现与同步实践
  • 2026全场景整合营销广告公司推荐:包揽品牌升级、整合传播! - 品牌种草官
  • LFM2.5-1.2B-Instruct效果展示:金融交易流水异常模式识别问答效果
  • Hotkey Detective:Windows热键冲突排查实用指南
  • 在 Taotoken 模型广场中根据任务与预算选择合适的模型
  • 用ChatGPT生成IRT数据:当大语言模型遇见心理测量学
  • Driver Store Explorer:释放Windows系统盘空间的终极解决方案
  • 从73.7到89.5,HALO 智能体用“轨迹分析“实现了递归自我进化
  • dirsearch 命令行选项详解:基于官方教程
  • CANN/torchtitan-npu版本策略
  • AGI+IoT融合:边缘智能体的关键技术挑战与实践路径
  • CANN/catlass FlashAttention推理
  • 2026人工草坪企业选型指南,采购不踩坑 - 深度智识库
  • StarRocks MCP Server实战:AI助手与数据库的无缝对话
  • 全球高价值公开数据源全景指南:从专利到遥感,数据科学家的实战地图
  • FLUX.1-Krea-Extracted-LoRA效果展示:丝绸面料光泽与褶皱物理模拟
  • Illustrator脚本开发入门:从零写一个‘日期+序列’的防伪码生成器
  • 大模型参数规模与性能的非线性关系:从规模迷信到精准设计
  • PostgreSQL中UPSERT操作的并发冲突与数据一致性保障策略
  • CANN社区组织信息配置指南
  • CANN/tensorflow HCCL发送API
  • 基于Electron构建开发者专属浏览器:集成调试、终端与源码映射
  • 2026年湖南数控机床设计与非标机床研发外协服务深度指南 - 年度推荐企业名录
  • 无需复杂SDK,使用curl命令直接测试Taotoken大模型API连通性
  • 新手教程使用Python和OpenAI兼容SDK五分钟接入Taotoken大模型服务
  • AI的“水足迹”:数据中心冷却与锂矿开采背后的环境伦理挑战
  • AI赋能人才管理:从数据画像到个性化发展路径的实践
  • Orangutan算法:仿生视觉注意力机制在计算机视觉中的应用