当前位置: 首页 > news >正文

CANN/catlass模板库优化指南

Template Library Optimization Guide

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

CATLASS Sample Positioning

The CATLASS operator template library is positioned as a template sample library for GEMM operators, which differs from typical operator libraries. In typical operator libraries, generalization optimizations are made for different input cases of a specific problem type to provide good out-of-the-box performance in most scenarios. The primary goal of the template library is to provide template samples for different inputs, enabling rapid custom development of high-performance operators. Theoretically, it does not aim to provide optimal generalization performance compared to operator libraries. For example, in a matmul context, the matmul operator or called API in CANN focuses on delivering performance for generalization scenarios through direct calls. The template library, instead, provides multiple matmul implementations, such as basic-matmul, optimized-matmul, splitk-matmul, and padding-splitk-matmul, as samples to demonstrate how to customize development for different inputs to achieve optimal performance. The matmul samples in the repository have different applicable scopes and tuning methods for different inputs, allowing on-demand customization for optimal performance.

For tuning methods, there are two categories: basic and custom. This document focuses on the first category, basic tuning, which achieves performance gains quickly through tiling parameter adjustment and kernel combination.

Matmul Fundamentals

Basic Block Tiling of Matrix C for on Cores

First, the task partitioning logic for each core. Samples in the repository such as 00_basic_matmul and 06_optimized_matmul all tile matrix C into basic blocks before assigning them to cores. Matrix C is tiled along the M and N axes based onL1TileShape::MandL1TileShape::N, resulting inCeilDiv(M, L1TileShape::M) * CeilDiv(N, L1TileShape::N)basic blocks. These basic blocks are then assigned to cube cores according to the swizzle policy.

Matmul Hardware Visualization

See Basic Architecture. The following figure shows the hardware architecture involved in basic Matmul tiling, data movement, and computation. Because double buffering is enabled, two tiles of data are stored in L1/L0A/L0B.

TileShape Constraints

From the above, TileShape must be set such that it does not exceed the L1/L0A/L0B/L0C memory space. Additionally, TileShape values must be multiples of 16.

  • Scenario 1: FP16 input and output, L1TileShape<128,256,256>, L0TileShape<128,256,64>

For FP16 input and output, to maintain calculation precision, the cube core's mmad output to L0C is of FP32 data type. The fixpipe casts it to FP16 when writing back to global memory.

L1 size: 512 KB Actual L1 usage = L1::M * L1::K * 2(Byte) * 2(doubleBuffer) + L1::K * L1::N * 2(Byte) * 2(doubleBuffer) = 128 * 256 * 2 * 2 + 256 * 256 * 2 * 2 = 393216 B = 384 KB = 3/4 L1_SIZE L0A size: 64 KB Actual L0A usage = L0::M * L0::K * 2(Byte) * 2(doubleBuffer) = 128 * 64 * 2 * 2 = 32768 B = 32 KB = 1/2 L0A_SIZE L0B size: 64 KB Actual L0B usage = L0::K * L0::N * 2(Byte) * 2(doubleBuffer) = 64 * 256 * 2 * 2 = 65536 B = 64 KB = 1 L0B_SIZE L0C size: 128 KB Actual L0C usage = L0::M * L0::N * 4(Byte) = 128 * 256 * 4 = 131072 B = 128 KB = 1 L0C_SIZE
  • Scenario 2: FP32 input and output, L1TileShape<128,128,256>, L0TileShape<128,128,64>
L1 size: 512 KB Actual L1 usage = L1::M * L1::K * 4(Byte) * 2(doubleBuffer) + L1::K * L1::N * 4(Byte) * 2(doubleBuffer) = 128 * 256 * 4 * 2 + 128 * 256 * 4 * 2 = 524288 B = 512 KB = 1 L1_SIZE L0A size: 64 KB Actual L0A usage = L0::M * L0::K * 4(Byte) * 2(doubleBuffer) = 128 * 64 * 4 * 2 = 65536 B = 64 KB = 1 L0A_SIZE L0B size: 64 KB Actual L0B usage = L0::K * L0::N * 4(Byte) * 2(doubleBuffer) = 64 * 128 * 4 * 2 = 65536 B = 64 KB = 1 L0B_SIZE L0C size: 128 KB Actual L0C usage = L0::M * L0::N * 4(Byte) = 128 * 128 * 4 = 65536 B = 64 KB = 1/2 L0C_SIZE

Tuning Policy

Sample Coverage and Selection

The basic Matmul operators currently in the repository are listed below. For more details, see the matrix multiplication template summary:

  • 00_basic_matmul: Uses the dispatch policy ofMmadAtlasA2Pingpongand enables the ping-pong policy.
template <bool ENABLE_UNIT_FLAG_ = false> struct MmadAtlasA2Pingpong : public MmadAtlasA2 { static constexpr uint32_t STAGES = 2; static constexpr bool ENABLE_UNIT_FLAG = ENABLE_UNIT_FLAG_; };
  • 04_padding_matmul: Adds padding for the input matrices based on 00_basic_matmul. Experiments have shown that when the shape[1] of a RowMajor matrix is aligned to 512 bytes, data movement efficiency is higher. Therefore, adding padding improves performance fornon-alignedscenarios.
  • 06_optimized_matmul: Uses thedispatchPolicyofMmadAtlasA2Preloadto enable preloading andshuffleK. Preloading reduces interruptions in the movement pipeline. ShuffleK randomizes the order in which different cores moveL1Tiles, reducing bank conflicts and causing the addresses accessed by cores in the same row or column of basic blocks when accessing A and B matrices to be staggered. It also addspaddingto align input matrices toL1TileShape. Compared tobasic_matmul, it introduces more optimization actions but also incurs overhead from enabling vector cores and some scalar computations.
template <bool ENABLE_UNIT_FLAG_ = false, bool ENABLE_SHUFFLE_K_ = false> struct MmadAtlasA2Preload : public MmadAtlasA2 { static constexpr uint32_t STAGES = 2; static constexpr bool ENABLE_UNIT_FLAG = ENABLE_UNIT_FLAG_; static constexpr bool ENABLE_SHUFFLE_K = ENABLE_SHUFFLE_K_; };
  • 09_splitk_matmul: Adds K-axis partitioning for per-core block assignment based on 00_basic_matmul. When the M/N axes are small and there are few basic blocks to tile, tiling the K-axis can improve cube core utilization. However, vector cores are required for accumulation after tiling the K-axis, so gains only occur when the K-axis has sufficient length.
  • 22_padding_splitk_matmul: Integrates the features of04_padding_matmuland09_splitk_matmul. It yields performance gains in non-aligned scenarios where the M/N axes are small and the K-axis has sufficient length.

TileShape Adjustment

Under the constraints of (1) being a multiple of 16 and (2) not exceeding hardware limits, adjust the TileShape to achieve load balancing. To achieve optimal performance while simplifying the tiling policy, the current solution in the repository restrictsL0TileShape::M == L1TileShape::MandL0TileShape::N == L1TileShape::Nto reduce tuning complexity. It is advised to setL0TileShape::K == 1/4 L1TileShape::K.

  • Case 1

Context: Matrix A RowMajor, matrix B ColumnMajor, M: 1024, N: 576, K: 6144, FP16 input and output, 20 AIC cores

Using 06_optimized_matmul with default L1TileShape<128,256,256> and L0TileShape<128,256,64>, the execution time is72.5 µs(performance may vary across different chip platforms, CANN packages, and drivers; this is for reference only).

Analysis: The number of tiled basic blocks isCeilDiv(1024/128) x CeilDiv(576/256) = 24. Therefore, 4 of the 20 AIC cores need to compute two basic blocks each, while the remaining 16 process one basic block each, resulting in load imbalance.

Adjust L1TileShape<256,128,256> and L0TileShape<256,128,64>. The number of tiled basic blocks becomesCeilDiv(1024/256) x CeilDiv(576/128) = 20. Thus, all 20 AIC cores process only one basic block each, achieving load balance. The execution time is48.6 µs.

  • Case 2

Context: Matrix A RowMajor, matrix B zN, M: 20, N: 6144, K: 16384, FP16 input and output, 20 AIC cores

Here, matrix B is in zN format (same as NZ format). Using 21_basic_matmul_preload_zN with default L1TileShape<128,256,256> and L0TileShape<128,256,64>, the execution time is181.4 µs.

Analysis: The number of basic blocks partitioned isCeilDiv(20/128) x CeilDiv(6144/256) = 24. Therefore, 4 of the 20 AIC cores need to compute two basic blocks each, while the remaining 16 process one basic block each, resulting in load imbalance.

Adjust L1TileShape<32,320,128> and L0TileShape<32,320,32>. The number of tiled basic blocks becomesCeilDiv(20/32) x CeilDiv(6144/320) = 20. Thus, all 20 AIC cores process only one basic block each, achieving load balance. The execution time is139.6 µs.

  • Case 3

Context: Matrix A RowMajor, matrix B ColumnMajor, M: 1, N: 768, K: 5120, FP32 input and output, 24 AIC cores

Here, both matrices A and B are laid out along the K-axis. The K-axis is 512-byte aligned. Using 00_basic_matmul directly with L1TileShape<128,128,256> and L0TileShape<128,128,64>, which are common for FP32 data type, the execution time is36.3 µs.

Analysis: The number of tiled basic blocks isCeilDiv(1/128) x CeilDiv(768/128) = 6. Therefore, only 6 of the 24 AIC cores are engaged in computation, resulting in load imbalance.

Given that matrix B is in ColumnMajor format, a finer-grained tiling along the N-axis is possible. Adjust L1TileShape<16,32,1024> and L0TileShape<16,32,256>. The number of tiled basic blocks becomesCeilDiv(1/16) x CeilDiv(768/32) = 24. Thus, all 24 AIC cores work and process only one basic block each, achieving load balance. The execution time is15.5 µs.

  • ⚠️ Note

Experiments have shown that for RowMajor/ColumnMajor layouts, movement performance is higher when L1::M/L1::N/L1::K are multiples of 256, but there is a trade-off with load balancing. For the zN format, the impact of being a multiple of 256 is smaller. For example, in Case 2, if matrix B were in RowMajor instead of zN layout, the performance with the default L1TileShape<128,256,256> and L0TileShape<128,256,64> would be238.8 µs. After adjusting to L1TileShape<32,320,128> and L0TileShape<32,320,32>, the performance becomes252.4 µs. Load balancing did not yield gains in this scenario.

Swizzle Adjustment

In CATLASS, Swizzle describes the read/write order of matrices. It is referred to asGemm::Block::GemmIdentityBlockSwizzle<a, b>using the notation<a, b>. When both matrices A and B are inRowMajorlayout,<3, 0>is typically chosen whenm > n, and<3, 1>whenm < n. In general, the approach to adjustingSwizzleis to first determine theSwizzleDirection(0or1), then adjust theSwizzleOffset. In some scenarios, this can better achieve load balancing.

  • Case 1

Context: Matrix A RowMajor, matrix B zN, M: 160, N: 6144, K: 2048, FP16 input and output, 20 AIC cores

Using 21_basic_matmul_preload_zN with default L1TileShape<128,256,256> and L0TileShape<128,256,64>, and swizzle set to <3, 1>, the execution time is40.6 µs. Setting swizzle to <4, 1> results in an execution time of35.3 µs.

Basic block analysis: The M-axis is tiled into two blocks of lengths 128 and 32, and the N-axis is tiled into 24 blocks of length 256, resulting in a total of 48 basic blocks. The figure below shows the assignment of basic blocks to AIC cores for swizzle <3, 1> and swizzle <4, 1>. With swizzle <3, 1>, cores 1, 2, 5, and 6 have a maximum task size along the M-axis of (128 + 128 + 32). With swizzle <4, 1>, cores 12, 13, 14, and 15 have a maximum task size along the M-axis of (128 + 128), achieving better load balance.

Brief Overview of Custom Tuning

  • Currently, the Matmul samples in the repository each have their own characteristics and advantages. You can perform custom development through deep code reassembly. For example, 21_basic_matmul_preload_zN assembles theMmadAtlasA2Preloaddispatch policy based on 00_basic_matmul, while 22_padding_splitk_matmul assembles the features of 04_padding_matmul and 09_splitk_matmul. After becoming familiar with the different sample codes in the repository, you can perform deep development based on your needs to achieve better performance. The template library will also continuously add Matmul samples that use new algorithms and apply to more use cases.
  • In addition to basic matmul custom tuning, some derived samples in the repository (such as 03_matmul_add, 20_matmul_bias, etc.) are often based on samples like 00_basic_matmul with new features added. These samples can also be customized to use different basic Matmul samples and undergo tiling parameter tuning to achieve better performance.

【免费下载链接】catlass本项目是CANN的算子模板库,提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/1071206/

相关文章:

  • Python虚拟显示神器PyVirtualDisplay:终极无头GUI测试解决方案
  • Android事件分发机制
  • PyVirtualDisplay完整指南:Xvfb、Xephyr和Xvnc三大后端深度解析
  • 深度解析MatchZoo与Awesome Neural Models for Semantic Match的集成应用
  • 如何快速入门Firo:隐私加密货币新手必备的完整指南
  • 3步实现企业微信客户资源零流失:从业务痛点到技术落地的完整策略
  • 从零到精通:如何用Intel RealSense SDK构建高精度三维视觉应用
  • 如何用Sing-Guard-4b构建安全的AI对话系统?完整案例演示
  • CANN/ge DataFlow Python注册函数指南
  • TeamSpeak 6 Server Docker部署终极指南:SQLite与MariaDB双数据库实战方案
  • ComfyUI-LTXVideo终极指南:5步实现专业级AI视频生成
  • CANN/catlass小形状矩阵乘法示例
  • JoyAI-Image-Edit-Plus-Diffusers未来路线图:AI图像编辑技术发展趋势
  • cann/runtime随机数生成示例
  • 002 使用单片机实现的逻辑分析仪——扩展篇
  • find、stat、touch、tree、scp、crontab指令相关应用
  • 5个实用技巧:用TimetableLayout打造专业级Android时间表应用
  • 性能优化秘籍:TP=2 vs TP=4配置对比,找到最佳GPU资源利用方案
  • TruecallerJS API深度解析:如何构建专业的电话号码验证系统
  • 解锁Java生态宝藏:从零构建企业级知识图谱的技术架构深度剖析
  • 98个公共Tracker完整指南:彻底解决BT下载卡顿难题
  • Bernini-R-GGUF-ComfyUI核心功能解析:为什么它是视频创作者的终极工具
  • ComfyUI-LTXVideo完整指南:如何在ComfyUI中轻松生成高质量AI视频
  • TruecallerJS实战应用:10个真实场景下的电话号码查询解决方案
  • CANN运行时设备到主机同步内存复制示例
  • Gemma-4-31B-StyleTune vs 传统微调:终极VRAM需求对比分析
  • 5个实战项目:用Deep Learning Illustrated代码构建深度学习应用
  • Open3D (C++) 三维轨迹点的等间距三次样条平滑与插值(间距可控)
  • Sapiens2-Pose-0.4B vs 其他姿态估计模型:为什么它是最佳选择?[特殊字符]
  • CANN ops-nn ApplyAdagradD算子