当前位置: 首页 > news >正文

CANN/hixl A3芯片性能数据

HIXL在昇腾A3芯片上部分场景实测性能数据

【免费下载链接】hixlHIXL(Huawei Xfer Library)是一个灵活、高效的昇腾单边通信库,面向集群场景提供简单、可靠、高效的点对点数据传输能力。项目地址: https://gitcode.com/cann/hixl

HIXL在昇腾A3芯片上,存在如下约束条件:

  • Atlas A3 训练/推理系列产品,该场景下采用HCCS传输协议时,不支持Host内存作为远端Cache(开启中转内存池时无此限制)。

所以未给出D2H/H2H场景下,采用HCCS传输协议时的实测性能数据。

单机场景 (CANN 9.0.0)

  • WRITE:
传输内存块大小HCCS D2D (GB/s)HCCS D2D BufferPool(GB/s)RDMA D2D(GB/s)RDMA D2D BufferPool(GB/s)
16K6.9166.4697.1147.036
32K16.05012.5965.7005.525
64K28.88222.78110.1269.795
128K45.63738.26116.39315.966
256K66.66758.22121.54821.051
512K81.96776.87621.84221.596
1M84.17581.69921.39721.237
2M93.07591.30821.53321.474
4M110.132110.42422.20222.187
8M113.020114.78422.35322.202
传输内存块大小HCCS H2D(GB/s)HCCS H2D BufferPool(GB/s)RDMA H2D(GB/s)RDMA H2D BufferPool(GB/s)
16K5.2772.7277.0372.659
32K11.3095.3475.6835.308
64K20.10910.65410.0758.904
128K24.60118.96816.41111.379
256K27.60624.10821.36013.823
512K29.64933.66521.86514.281
1M29.86933.65621.35315.038
2M30.72033.69321.41115.218
4M32.19233.41422.21015.031
8M32.31634.05122.26614.471
传输内存块大小HCCS D2H(GB/s)HCCS D2H BufferPool(GB/s)RDMA D2H(GB/s)RDMA D2H BufferPool(GB/s)
16K——3.2926.7143.164
32K——7.9385.6697.608
64K——12.30310.09412.951
128K——23.24716.35914.516
256K——24.75221.51115.214
512K——26.50821.81915.274
1M——26.65821.23015.426
2M——26.61821.33115.576
4M——27.25722.17115.438
8M——26.37122.27815.371
传输内存块大小HCCS H2H(GB/s)HCCS H2H BufferPool(GB/s)RDMA H2H(GB/s)RDMA H2H BufferPool(GB/s)
16K——2.5266.1282.367
32K——5.3005.7025.062
64K——10.31910.1189.386
128K——17.47816.46714.760
256K——24.72821.51515.218
512K——25.23221.86415.383
1M——25.82121.37115.379
2M——26.08521.54115.315
4M——26.14522.22215.375
8M——25.63622.31715.419
  • READ:
传输内存块大小HCCS D2D(GB/s)HCCS D2D BufferPool(GB/s)RDMA D2D(GB/s)RDMA D2D BufferPool(GB/s)
16K7.0756.6506.6655.768
32K17.33213.2005.6095.456
64K30.55524.2629.9719.694
128K49.55440.90315.20615.932
256K74.27263.80821.33820.987
512K94.98585.09221.70121.664
1M95.12990.97521.32721.251
2M107.388103.47721.36421.389
4M131.441125.88122.13622.151
8M134.553134.98922.23022.159
传输内存块大小HCCS H2D(GB/s)HCCS H2D BufferPool(GB/s)RDMA H2D(GB/s)RDMA H2D BufferPool(GB/s)
16K5.2342.2486.1812.339
32K11.6524.4635.6204.265
64K19.9688.4999.9497.312
128K24.31417.27516.28710.318
256K27.75921.59621.18311.774
512K30.27425.29321.76914.345
1M30.71824.81121.30915.400
2M31.64626.96321.40015.623
4M33.40530.76522.11615.711
8M33.69327.31622.15514.760
传输内存块大小HCCS D2H(GB/s)HCCS D2H BufferPool(GB/s)RDMA D2H(GB/s)RDMA D2H BufferPool(GB/s)
16K——2.9746.6163.467
32K——5.7545.6127.962
64K——11.5789.92414.689
128K——23.17816.20215.565
256K——35.41121.26616.079
512K——37.74221.72016.232
1M——39.16021.32416.359
2M——39.35821.48516.398
4M——39.67022.12416.361
8M——37.98222.12016.308
传输内存块大小HCCS H2H(GB/s)HCCS H2H BufferPool(GB/s)RDMA H2H(GB/s)RDMA H2H BufferPool(GB/s)
16K——2.9336.5642.162
32K——6.7215.5964.670
64K——14.0269.9729.658
128K——19.48016.21315.194
256K——21.84921.38915.704
512K——24.35721.72015.901
1M——22.78921.30616.017
2M——23.01621.92616.100
4M——23.26422.13616.119
8M——22.67822.20216.069

双机场景 (CANN 9.0.0)

  • WRITE
传输内存块大小HCCS D2D(GB/s)HCCS D2D BufferPool(GB/s)RDMA D2D(GB/s)RDMA D2D BufferPool(GB/s)
16K4.0333.8495.7537.181
32K8.8228.4825.6895.551
64K16.60715.75310.0939.850
128K28.84827.92716.38116.156
256K46.99245.22421.50721.226
512K64.96964.86821.78521.728
1M74.85073.79021.32721.277
2M86.80685.49921.39721.459
4M106.112106.11222.14722.199
8M113.020110.42422.28622.305
传输内存块大小HCCS H2D(GB/s)HCCS H2D BufferPool(GB/s)RDMA H2D(GB/s)RDMA H2D BufferPool(GB/s)
16K3.4062.6906.1592.648
32K7.1465.2565.6855.019
64K13.32910.40910.0677.738
128K15.06018.82616.40210.194
256K16.23023.41321.40011.391
512K16.89632.64721.79212.860
1M16.98033.08621.35713.319
2M17.32533.36921.53314.086
4M17.98633.87522.20213.569
8M18.09532.86922.24212.974
传输内存块大小HCCS D2H(GB/s)HCCS D2H BufferPool(GB/s)RDMA D2H(GB/s)RDMA D2H BufferPool(GB/s)
16K——3.6006.0183.452
32K——8.7185.6917.630
64K——15.86310.09613.230
128K——23.91416.46314.384
256K——25.75721.42214.795
512K——26.33821.88814.832
1M——27.57621.40415.095
2M——27.60621.55215.026
4M——27.82122.21015.194
8M——27.34622.31715.075
传输内存块大小HCCS H2H(GB/s)HCCS H2H BufferPool(GB/s)RDMA H2H(GB/s)RDMA H2H BufferPool(GB/s)
16K——2.4126.4672.586
32K——4.7935.7035.084
64K——9.36310.1129.694
128K——16.49116.37214.715
256K——25.54721.47015.332
512K——26.02021.85315.478
1M——26.32721.35315.476
2M——26.56221.47815.551
4M——26.50022.22215.584
8M——26.35522.29815.543
  • READ
传输内存块大小HCCS D2D(GB/s)HCCS D2D BufferPool(GB/s)RDMA D2D(GB/s)RDMA D2D BufferPool(GB/s)
16K4.1423.9396.8346.064
32K9.1768.7895.6075.548
64K17.13316.1759.9479.508
128K29.84028.87516.13715.650
256K49.74147.91121.21220.750
512K71.26670.70121.56321.478
1M83.33382.39921.15821.111
2M97.50497.50421.34921.298
4M124,008123.39621.94122.081
8M133.120130.48022.06922.187
传输内存块大小HCCS H2D(GB/s)HCCS H2D BufferPool(GB/s)RDMA H2D(GB/s)RDMA H2D BufferPool(GB/s)
16K3.4292.8636.4322.499
32K7.3066.4015.6055.237
64K13.26711.1899.9038.161
128K15.23818.85116.16710.909
256K16.68221.81921.13311.925
512K17.48721.30221.66814.473
1M17.54423.77321.12615.305
2M17.84724.78721.25515.645
4M18.38524.65022.07315.676
8M18.45821.70122.15114.498
传输内存块大小HCCS D2H(GB/s)HCCS D2H BufferPool(GB/s)RDMA D2H(GB/s)RDMA D2H BufferPool(GB/s)
16K——2.8386.2712.724
32K——5.8445.6075.285
64K——10.3519.94410.485
128K——16.65316.21715.235
256K——30.33221.26616.133
512K——37.50421.72016.504
1M——37.99421.14316.473
2M——38.12121.39316.640
4M——38.37922.03816.510
8M——37.81022.07316.554
传输内存块大小HCCS H2H(GB/s)HCCS H2H BufferPool(GB/s)RDMA H2H(GB/s)RDMA H2H BufferPool(GB/s)
16K——2.9146.2341.976
32K——6.8725.5974.138
64K——12.3339.9508.247
128K——20.50516.18315.332
256K——20.66121.19016.028
512K——26.26121.66416.255
1M——21.99921.20416.346
2M——26.83021.37816.482
4M——22.55922.03016.456
8M——21.74722.19116.342

【免费下载链接】hixlHIXL(Huawei Xfer Library)是一个灵活、高效的昇腾单边通信库,面向集群场景提供简单、可靠、高效的点对点数据传输能力。项目地址: https://gitcode.com/cann/hixl

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/786692/

相关文章:

  • GHPT:基于记忆与规划的智能代码生成框架深度解析
  • 事件相机与稀疏3D卷积技术解析及无人机检测应用
  • 构建可信AI食品系统:技术、伦理与治理的跨学科实践
  • CANN/ascend-transformer-boost自定义算子开发指南
  • 告别物理串口线:com0com虚拟串口驱动全方位实战指南
  • Tracciatto:基于rdbg的Ruby调试环境增强套件详解
  • LangGraph:构建复杂AI工作流与有状态智能体的图计算框架
  • AI应用落地实战:从算法选型到工程部署的可持续架构
  • ARM汇编器FPU配置与性能优化指南
  • Arm CoreLink SSE-200安全架构与寄存器配置详解
  • React自定义光标组件cursorify:从原理到实战的完整指南
  • SpringBoot+Vue 在线招投标系统管理平台源码【适合毕设/课设/学习】Java+MySQL
  • AI增强型本地优先路线图规划器:可视化思维与智能协作
  • 如何用scrapy-pinduoduo构建电商数据智能分析管道
  • 基于Pix2Pix GAN的火山灰云卫星图像智能分割方法研究
  • kill-doc:浏览器文档下载神器,告别付费墙和登录限制
  • 开源TTS工具在低资源语言中的实战评估与优化
  • CANN/hcomm:获取组内rank ID
  • 使用Taotoken后API调用延迟稳定且账单清晰可追溯的实际感受
  • 基于大语言模型的科学实验报告自动评估系统设计与实践
  • SPI可编程死区+故障状态回读:STGAP1BSTR的智能化驱动配置方案
  • 双非拿下美团大模型Offer!我的面试复盘与血泪建议,小白也能看懂并收藏!
  • 汽车电子HIL测试:原理、实现与工程实践
  • 基于Milvus的zilliz-skill框架:从向量数据库到AI技能编排的范式跃迁
  • 华为/HCCL多QP通信阈值配置
  • LeetCode 155. 最小栈
  • 创业公司如何利用Taotoken聚合API低成本验证多个AI产品创意
  • 为什么封装越优雅的 SQL 跑得越慢?条件下推破解痛点
  • Webpack日志转发插件:将浏览器Console输出实时同步至终端
  • 如何在OpenClaw中配置Taotoken作为其AI能力供应商