当前位置：首页 > news >正文

CANN/cann-recipes-embodied-intelligence ACT训练样例

news 2026/7/10 14:13:20

ACT 在昇腾 Atlas A2 上的训练样例

【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法，提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence

本目录提供 ACT 训练样例，完成 ALOHAtransfer_cube任务的模型训练，以及方便拓展到其他任务上。

当前样例遵循以下原则：

cann-recipes仓库中仅保存训练样例目录、配置、脚本、文档和补丁；
lerobot作为外部依赖仓单独 clone；
setup.sh固定lerobotcommit id，并对已验证的通用 Ascend 训练补丁执行git apply；

1. 适用场景

硬件：昇腾 Atlas A2
CANN：8.3.0 及以上
任务：AlohaTransferCube-v0
数据集：lerobot/aloha_sim_transfer_cube_human
外部训练框架：huggingface/lerobot

2. 外部依赖与固定版本

本样例不内嵌lerobot源码，默认使用如下 commit：

58f70b6bd370864139a3795ac3497a9eae8c42d5

3. 目录说明

manipulation/act/train/ ├── README.md ├── doc/ │ └── README.md └── src/ ├── configs/ │ ├── act_aloha.yaml │ └── act_aloha_smoke.yaml ├── patches/ │ └── lerobot_ascend_train_common.patch └── scripts/ ├── run_eval.sh ├── run_train.sh └── setup.sh

4. 环境准备

4.1 clone 代码

git clone https://gitcode.com/cann/cann-recipes-embodied-intelligence.git cd cann-recipes-embodied-intelligence

4.2 准备`lerobot`

执行：

chmod +x manipulation/act/train/src/scripts/setup.sh ./manipulation/act/train/src/scripts/setup.sh

该脚本会：

在cann-recipes同级目录下准备lerobot代码仓；
checkout 到固定 commit58f70b6bd370864139a3795ac3497a9eae8c42d5；
应用当前已验证的 Ascend 训练补丁（包含 ACT 使用torchcodec所需的视频解码容忍度修正）；
安装 ACT 所需的 LeRobot 通用 Python 依赖与gym-aloha；
默认复用当前已激活环境中的torch/torch_npu；
如需在新环境中执行，可通过参数创建 conda 环境，并通过本地 wheel 注入平台相关的torch/torchvision/torch_npu。

常见用法：

# 查看脚本帮助 ./manipulation/act/train/src/scripts/setup.sh --help # 用当前已准备好的 Ascend 环境 ./manipulation/act/train/src/scripts/setup.sh # 创建新 conda 环境，并从本地 wheel 安装平台栈 ./manipulation/act/train/src/scripts/setup.sh \ --create-conda \ --env-name lerobot-act \ --python-version 3.10 \ --torch-wheel /path/to/torch.whl \ --torchvision-wheel /path/to/torchvision.whl \ --torch-npu-wheel /path/to/torch_npu.whl

说明：

之所以不在脚本中硬编码torch_npu下载链接，是因为有效的 wheel 组合依赖于宿主机架构、CANN 版本和 Ascend 软件栈；
这部分平台依赖建议由已有训练环境复用，或由使用者自行提供本地 wheel。
如已提前确认平台栈可用，也可以追加--skip-torch-check跳过末尾导入校验。

4.3 数据集路径

当前配置默认使用工作区相对路径：

../dataset/lerobot/aloha_sim_transfer_cube_human

如需调整，请修改：

src/configs/act_aloha.yaml
src/configs/act_aloha_smoke.yaml

这些相对路径默认相对于lerobot根目录解析，推荐工作区布局如下：

<workspace>/ ├── cann-recipes-embodied-intelligence/ ├── lerobot/ ├── dataset/ │ └── lerobot/ │ └── aloha_sim_transfer_cube_human/ └── ckpt/

要求：root必须直接指向包含data/、meta/的数据集根目录。

4.4 ResNet18 权重缓存

ACT 默认使用：

pretrained_backbone_weights: ResNet18_Weights.IMAGENET1K_V1

首次训练或评测时，PyTorch 可能会尝试下载resnet18-f37072fd.pth。在无外网环境中，建议提前将该文件放到当前用户的 PyTorch 权重缓存目录，例如：

~/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth

可在有外网的机器上从 PyTorch 官方地址下载：

wget -O resnet18-f37072fd.pth https://download.pytorch.org/models/resnet18-f37072fd.pth

也可以使用：

curl -L https://download.pytorch.org/models/resnet18-f37072fd.pth -o resnet18-f37072fd.pth

下载后，将文件拷贝到目标机器的 PyTorch 权重缓存目录：

mkdir -p ~/.cache/torch/hub/checkpoints cp resnet18-f37072fd.pth ~/.cache/torch/hub/checkpoints/

如果设置了TORCH_HOME，则实际缓存目录为$TORCH_HOME/hub/checkpoints/。可以通过以下命令确认当前环境的缓存根目录：

python -c "import torch; print(torch.hub.get_dir())"

如果服务器无法联网，又没有提前缓存，ACT 会在模型构建阶段失败。

5. 训练配置

5.1 smoke 配置

配置文件：src/configs/act_aloha_smoke.yaml
作用：快速验证环境、数据、依赖和多卡训练链路
关键参数：
- steps: 20
- wandb.enable: false

启动：

./manipulation/act/train/src/scripts/run_train.sh act_aloha_smoke --port 29510

5.2 长训配置

配置文件：src/configs/act_aloha.yaml
关键参数：
- dataset.video_backend: torchcodec
- steps: 100000
- batch_size: 8
- num_workers: 4
- wandb.enable: true

启动：

./manipulation/act/train/src/scripts/run_train.sh act_aloha --port 29510

6. 评测说明

run_eval.sh只是对lerobot-eval的轻量封装，参数直接透传。

在线评测建议将 MuJoCo 仿真与渲染放在 CPU 侧执行，policy 推理继续使用 NPU；
原因见 doc/README.md。

示例：

export MUJOCO_GL=osmesa ./manipulation/act/train/src/scripts/run_eval.sh \ --policy.path=/path/to/pretrained_model \ --policy.device=npu \ --env.type=aloha \ --env.task=AlohaTransferCube-v0 \ --eval.n_episodes=100 \ --eval.batch_size=20 \ --output_dir=/path/to/eval_out

说明：

MUJOCO_GL=osmesa表示 MuJoCo 使用 CPU 软件渲染；
--policy.device=npu表示模型前向推理继续放在 NPU；
这种方式对应“仿真在 CPU，推理在 NPU”。

7. 已验证结果摘要

当前样例已切换为默认使用torchcodec解码视频。当前已验证的一组参考结果：

训练任务：ACT onlerobot/aloha_sim_transfer_cube_human
任务环境：AlohaTransferCube-v0
数据规模：50episodes，20000frames
训练硬件：昇腾 Atlas A28卡
训练步数：100000
训练 batch 配置：batch_size: 8，全局 batch size64
统计区间：W&Btrain/steps = 5000 ~ 20000
评测方式：5 x 100episodes
评测总成功率：68.0%

当前已完成一组100 step的快速吞吐验证，可作为当前配置下的参考最佳结果：

场景	统计区间	mean_updt_s	mean_data_s	end-to-end samples/s
`8 cards x bs64 x torchcodec`	`step 10~100`	`0.3191`	`0.3544`	`760.24`

更详细的环境、日志、checkpoint 路径和评测说明见：

doc/README.md

说明：

默认配置中的video_backend已显式设置为torchcodec；

8. W&B 记录占位

9. 常用命令

查看训练日志

cd ../lerobot tail -f ../ckpt/logs/train_act_aloha_*.log

resume 训练

./manipulation/act/train/src/scripts/run_train.sh act_aloha --resume --port 29510

10. 相关说明

本样例目录不包含lerobot源码；
若后续需要扩展到其他 ALOHA 数据集，可新增新的 YAML。
样例参考 https://gitcode.com/cann/cann-recipes-embodied-intelligence/blob/master/manipulation/pi05/train/README.md

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

查看全文

http://www.jsqmd.com/news/783025/