当前位置: 首页 > news >正文

CANN/catlass优化矩阵乘法示例

OptimizedMatmul Example Readme

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

Code Organization

├── 06_optimized_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── optimized_matmul.cpp # Main file

Function

This example demonstrates optimized matrix multiplication. Compared to the00_basic_matmulexample , this implementation replaces the dispatch policy withMmadAtlasA2Preloadand introduces padding preprocessing for the input matrices to improve data transfer performance.

Example

  • After obtaining the code, compile the operator executable file. For details, see Template Library Quick Start.
  • Execute the operator.
# Compile a specified test case. bash scripts/build.sh 06_optimized_matmul cd output/bin # Executable file name | Matrix M-axis | N-axis | K-axis | Device ID # The device ID is optional. The default value is 0. ./06_optimized_matmul 256 512 1024 0

If the following result is displayed, precision verification is successful.

Compare success.

Remarks

In this example, the default padding action usesPADDING_NZ. You can switch this toPADDING_BLOCK_NDto evaluate alternative performance profiles.

  • PADDING_NZThe code configuration is as follows:
constexpr PaddingTag paddingTagA = (std::is_same_v<LayoutA, layout::zN> || std::is_same_v<LayoutA, layout::nZ>) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ; constexpr PaddingTag paddingTagB = (std::is_same_v<LayoutB, layout::zN> || std::is_same_v<LayoutB, layout::nZ>) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ;

TheCOMPUTE_LENGTHallocated in the UB under thePADDING_NZpolicy is 48 KB:

static const uint32_t COMPUTE_LENGTH_A = 48 * 1024 / sizeof(ElementA); static const uint32_t COMPUTE_LENGTH_B = 48 * 1024 / sizeof(ElementB);
  • PADDING_BLOCK_NDThe modifications required to enablePADDING_BLOCK_NDare shown below. When the input matrix is not in NZ format, this policy aligns and pads the matrix according toL1TileShape:
constexpr PaddingTag paddingTagA = (std::is_same_v<LayoutA, layout::zN> || std::is_same_v<LayoutA, layout::nZ>) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; + : PaddingTag::PADDING_BLOCK_ND; constexpr PaddingTag paddingTagB = (std::is_same_v<LayoutB, layout::zN> || std::is_same_v<LayoutB, layout::nZ>) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; + : PaddingTag::PADDING_BLOCK_ND;

TheCOMPUTE_LENGTHallocated in the UB scales up to 96 KB under thePADDING_BLOCK_NDpolicy:

-static const uint32_t COMPUTE_LENGTH_A = 48 * 1024 / sizeof(ElementA); -static const uint32_t COMPUTE_LENGTH_B = 48 * 1024 / sizeof(ElementB); +static const uint32_t COMPUTE_LENGTH_A = 96 * 1024 / sizeof(ElementA); +static const uint32_t COMPUTE_LENGTH_B = 96 * 1024 / sizeof(ElementB);

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/1071169/

相关文章:

  • 10分钟掌握vite-vue3-chrome-extension-v3国际化:多语言扩展从零开始
  • 快速上手hspec:10分钟学会Haskell BDD测试框架 [特殊字符]
  • JoyAI-Image-Edit-Plus-Diffusers核心功能解析:Diffusers库的增强版图像编辑神器
  • 70款抖音快手封面边框模板设计动漫画电影视解说短剧视频透明图文模版
  • Ngx-restangular 测试策略:单元测试和集成测试完整指南
  • 实战教程:使用 Sapiens2-Pose-0.4B 进行实时人体姿态检测
  • 终极指南:5分钟解决oh-my-posh终端美化所有问题
  • 如何用Gemma-4-26B-A4B-StyleTune提升创作质量?新手必看的AI写作指南 [特殊字符]
  • FastContext-1.0-4B-RL性能评测:如何在SWE-bench上实现5.5%准确率提升
  • Laravel Search String快速入门:5个简单步骤实现智能搜索
  • Caesonia故障排除:OpenBSD邮件服务常见问题解决方案和调试方法
  • Serpl部署与分发:如何打包和发布你的自定义版本到各大平台
  • 终极TypeScript+Vue3开发体验:vite-vue3-chrome-extension-v3类型安全实践指南
  • REL源码解析:深入理解Golang ORM的设计哲学与架构实现 [特殊字符]
  • Sing-Guard-2b核心功能揭秘:6大安全场景全覆盖,动态策略推理如何实现?
  • Bernini-R-GGUF-ComfyUI安装教程:5分钟快速部署AI视频生成环境
  • ClothSimulation在游戏开发中的应用:实时布料模拟实战
  • FreeOpcUa在实际项目中的应用案例:工业自动化系统的集成经验
  • Agora-Flutter-SDK高级功能实战:美颜、虚拟背景与空间音频实现
  • The Lightmapper对比分析:与其他Blender光照贴图插件的优劣比较
  • Contra.js生态系统:10个扩展插件与社区工具推荐指南
  • Atropos环境开发指南:从零开始构建自定义强化学习场景
  • 终极Playwright CLI指南:如何用命令行掌控浏览器自动化
  • XRCarouselView源码解析:理解iOS轮播控件的核心实现原理
  • 10个CatSniffer实用技巧:从基础嗅探到高级攻击的完整教程
  • Continuum部署指南:从GitHub Releases到Discoverium的应用分发
  • sniffer源码解析:Go语言实现高性能网络流量捕获的关键技术
  • React Native CarPlay 权限与证书配置:快速获取苹果CarPlay权限的终极指南
  • 开源项目rutracker-proxy深度评测:安全、高效、免费的Rutracker访问工具
  • 如何快速上手Creeper:10分钟学会编写第一个爬虫脚本