当前位置: 首页 > news >正文

CANN/cann-recipes-train:Qwen3-30B-A3B医学SFT训练示例

Qwen3-30B-A3B Medical SFT Training Example

【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-train

This example uses the torchtitan-npu framework to fine-tuneQwen3-30B-A3Bon a medical domain SFT task. Training effectiveness is measured via Keyword Recall on medical Q&A samples.

The Medical R1 dataset (question/think/answer three-field format) is used for training. The MoE parallelism config (EP=8) enables full-parameter fine-tuning on a single node with 16 cards. Evaluation uses vLLM + vLLM-Ascend to compare the base model, CPT checkpoint, and SFT model under the same conditions.

Supported Products

ItemSpec
ProductAtlas A3 series
Recommended cards16 (EP=8)
CANN version9.0.0
Python3.11
Training frameworktorchtitan-npu
Inference frameworkvLLM + vLLM-Ascend

Files

FileDescription
README_EN.mdThis document
README.mdChinese documentation
config_registry_medical.pytorchtitan-npu Qwen3-30B-A3B medical SFT config
run_medical_sft.shTraining launch script (copy to torchtitan-npu dir before running)
prepare_medical_r1_dataset.pyMedical R1 dataset split tool
figures/training_loss.pngTraining loss curve (Epoch 1-5, optimal at step 156)

Environment Setup

1. Docker Container

Use an Ascend training image with CANN 9.0.0 and Python 3.11 pre-installed. Example for single-node 16-card setup:

docker run -itd \ --device=/dev/davinci0 --device=/dev/davinci1 \ --device=/dev/davinci2 --device=/dev/davinci3 \ --device=/dev/davinci4 --device=/dev/davinci5 \ --device=/dev/davinci6 --device=/dev/davinci7 \ --device=/dev/davinci_manager --device=/dev/devmm_svm \ --device=/dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /home:/home \ -v /data:/data \ --net=host \ --shm-size=128g \ --privileged \ --name qwen3_30b_medical_sft \ cann:9.0.0-a3-openeuler24.03-py3.11 \ /bin/bash

Initialize CANN after entering the container. CANN paths vary by deployment method — adjust according to your environment:

# Docker image default path source /usr/local/Ascend/ascend-toolkit/set_env.sh # Conda installation path (example for CANN 9.0.0) source /home/developer/Ascend/cann-9.0.0/set_env.sh source /home/developer/Ascend/nnal/atb/set_env.sh # If the system libstdc++ is too old export LD_PRELOAD=/path/to/conda/envs/torchtitan/lib/libstdc++.so.6

2. Install torchtitan-npu

git clone https://link.gitcode.com/i/a16fe6012169aa86df6ff4c2d4faa8cd.git cd torchtitan-npu pip install -r requirements.txt pip install -e .

Dataset

Downloadr1_data_example.jsonlfrom ModelScope and place it in theassetsdirectory of torchtitan-npu:

cd /path/to/torchtitan-npu mkdir -p assets # Manually download from # https://modelscope.cn/datasets/krisfu/delicate_medical_r1_data/files # to assets/ ls assets/r1_data_example.jsonl

Then split using the provided script:

python /path/to/recipe/prepare_medical_r1_dataset.py \ --input ./assets/r1_data_example.jsonl \ --output ./assets/medical_r1

Split result:

DatasetSamplesUse
train.jsonl~2,166SFT training
test.jsonl~241Keyword Recall evaluation

Model Weights

DownloadQwen3-30B-A3Bweights (~60 GB) from ModelScope and create a symlink in the torchtitan-npu source directory:

pip install modelscope mkdir -p /data/models/Qwen3-30B-A3B modelscope download \ --model Qwen/Qwen3-30B-A3B \ --local_dir /data/models/Qwen3-30B-A3B cd /path/to/torchtitan-npu mkdir -p assets/hf ln -sf /data/models/Qwen3-30B-A3B assets/hf/Qwen3-30B-A3B

Training Configuration

Config Registration

Copyconfig_registry_medical.pyto the torchtitan-npu source:

cp /path/to/recipe/config_registry_medical.py \ /path/to/torchtitan-npu/torchtitan_npu/models/qwen3/config_registry_medical.py

Then append the following totorchtitan_npu/models/qwen3/config_registry.py:

from torchtitan_npu.models.qwen3.config_registry_medical import ( sft_qwen3_30ba3b_medical, sft_qwen3_30ba3b_medical_tnd, )

Parallelism Strategy

Single-node 16-card MoE parallelism (CP=2, EP=8, TP=2):

ParameterValueDescription
NGPU16Total cards
context_parallel_degree2Context parallelism
tensor_parallel_degree2Tensor parallelism
expert_parallel_degree8128 experts sharded along EP
pipeline_parallel_degree1PP disabled
data_parallel_shard_degree-1FSDP full shard (mesh size = 4)

Hyperparameters

ConfigRecommended ValueDescription
steps156Training steps (5 epochs, ~31 steps/epoch)
lr2e-5Learning rate
warmup_steps5Warmup steps
local_batch_size1Per-device batch size
seq_len4096Sequence length
activation_checkpointselectiveSelective recomputation
TRAIN_DATASplit training setTraining data path, set viaTRAIN_DATAenv var
MODEL_DIRassets/hf/Qwen3-30B-A3BHF weights path

Sample Format

This example uses the R1 think template format, wrapping the dataset'sthinkfield in<think>tags:

def _process_sample(sample): output = f"<think>\n{sample['think']}\n</think>\n\n{sample['answer']}" return [ {"role": "user", "content": sample["question"]}, {"role": "assistant", "content": output}, ]

Attention Variants

Config functionAttention typeDescription
sft_qwen3_30ba3b_medicalBSND (SDPA)Reference only
sft_qwen3_30ba3b_medical_tndTND (NPUVarlenAttention)Recommended, validated

UseCONFIG=sft_qwen3_30ba3b_medical_tndfor TND variant.

Training

Launch

Copy the launch script to the torchtitan-npu directory and execute:

cp /path/to/recipe/run_medical_sft.sh /path/to/torchtitan-npu/ cd /path/to/torchtitan-npu bash run_medical_sft.sh

The script uses environment variablesNGPU=16andCONFIG=sft_qwen3_30ba3b_medical_tndfor the TND variant. Example log output (EP=8):

step: 1 loss: 1.45426 memory: 37.73GiB(61.58%) tps: 59 69.018s (compilation) step: 2 loss: 1.39178 memory: 52.27GiB(85.31%) tps: 798 5.135s step: 3 loss: 1.26931 memory: 52.31GiB(85.37%) tps: 1215 3.370s step: 10 loss: 1.02183 memory: 52.44GiB(85.59%) tps: 993 4.126s step: 20 loss: 0.95751 memory: 52.44GiB(85.59%) tps: 1199 3.416s step: 31 loss: 0.70617 memory: 52.44GiB(85.59%) tps: 1345 3.046s ← epoch 1 end step: 32 loss: 0.67716 memory: 52.44GiB(85.59%) tps: 701 5.842s step: 50 loss: 0.58786 memory: 52.50GiB(85.69%) tps: 1010 4.056s step: 62 loss: 0.34057 memory: 52.56GiB(85.79%) tps: 1177 3.479s ← epoch 2 end step: 63 loss: 0.33076 memory: 52.56GiB(85.79%) tps: 803 5.102s step: 90 loss: 0.19230 memory: 52.56GiB(85.79%) tps: 733 5.590s step: 93 loss: 0.16940 memory: 52.56GiB(85.79%) tps: 1014 4.040s ← epoch 3 end step: 94 loss: 0.16507 memory: 52.56GiB(85.79%) tps: 1286 3.185s step: 120 loss: 0.08754 memory: 52.62GiB(85.88%) tps: 942 4.349s step: 124 loss: 0.08219 memory: 52.62GiB(85.88%) tps: 1257 3.260s ← epoch 4 end step: 125 loss: 0.08480 memory: 52.62GiB(85.88%) tps: 1274 3.215s step: 150 loss: 0.04411 memory: 52.62GiB(85.88%) tps: 918 4.462s step: 155 loss: 0.04376 memory: 52.62GiB(85.88%) tps: 1199 3.416s step: 156 loss: 0.04450 memory: 52.62GiB(85.88%) tps: 1244 3.292s ← end (epoch 5)

Training Loss Curve

Based on loss curve analysis,Epoch 5 (step 156) is the optimal stop: loss decline flattens after step 150, and training beyond 186 steps (epoch 6+) enters the overfitting regime with no meaningful loss improvement.

Model Export

Withlast_save_in_hf=True, the final checkpoint is exported in HuggingFace format:

mkdir -p /data/models/Qwen3-30B-A3B-SFT cp /data/models/Qwen3-30B-A3B/*.json /data/models/Qwen3-30B-A3B-SFT/ cp /data/models/Qwen3-30B-A3B/tokenizer* /data/models/Qwen3-30B-A3B-SFT/ cp checkpoint_medical/step-156/*.safetensors* /data/models/Qwen3-30B-A3B-SFT/

Evaluation Results

Evaluation Method

This experiment uses a jieba-based keyword extraction method with POS tagging (n, v, a, i, j, l) to extract keywords from both reference answers and model outputs, then computes:

  • Recall= matched reference keywords / total reference keywords
  • Precision= matched reference keywords / total model keywords
  • F1= harmonic mean of Recall and Precision
  • Think Rate= proportion of outputs containing<think>reasoning

Evaluation data: 241 medical Q&A samples. Base model, CPT intermediate checkpoint, and SFT model are compared under the same conditions.

Keyword Recall Comparison

ModelRecallPrecisionF1
Base (Qwen3-30B-A3B)53.83%25.16%33.30%
CPT Checkpoint (step 156)62.45%28.06%37.82%
Improvement+8.62pp+2.90pp+4.52pp

Output Format Comparison

MetricBaseCPT
Avg output length1,061 chars831 chars(-21.7%)
Format errors (repeated</think>)199/2419/241

Sample: "What are the two components of consciousness?"

ItemBase ModelCPT Checkpoint
Answer</think> The components... arousal... content...(Markdown list + 3x</think>)Consciousness consists of two parts: the content and the switch system...(conversational)
Recall52.4%95.2%
Length392 chars287 chars

Training Metrics

MetricValue
Stable step time~3.2-3.5s
Stable memory~52.6 GiB/card (85.9%)
Loss start (step 1)1.45
Loss end (step 156)0.045
Total time (156 steps)~8-9 minutes

Note: With CP=2, TP=2 the memory usage per card is ~52.6 GiB (85.9%).

FAQ

1. Loss starts abnormally high

If initial loss is significantly higher than expected (e.g., ~12), check whether HF pretrained weights were loaded correctly. Delete the checkpoint directory before re-running:

rm -rf checkpoint_medical

2. NPU out of memory

Check for residual processes occupying NPU memory and ensurePYTORCH_NPU_ALLOC_CONF="expandable_segments:True"is set. If necessary, addtorch.npu.set_per_process_memory_fraction(1.0)at the entry.py entry point.

3. HCCL communication timeout

Multi-card training may trigger HCCL watchdog timeout. If intermittent, restarting training usually resolves it. If frequent, check HCCL network configuration and inter-node communication.

【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法,提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-train

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/1120327/

相关文章:

  • Gemini-3.1-Pro与Gemini-3-Flash真实效果与成本对比分析
  • Genome:Swift开发者必备的类型安全JSON映射库终极指南
  • 霍尼韦尔UCM终结者板解析与工业自动化维护
  • 迷你世界UGc3.0脚本Wiki[剧情动画模块管理接口 Timeline]
  • 如何备份恢复Panel Colorizer配置:跨系统迁移的完整流程
  • DeepSeek-V2与GPT-4o真实对比:中文理解、代码生成与推理成本分析
  • AI 生成设计规范文档:别让组件说明停在截图旁边
  • 如何利用nwpu-cram掌握数据挖掘核心算法:关联规则与聚类完整指南
  • SpringBoot中使用Arthas提取Druid内存数据源配置
  • AI 3D场景自动化生成:从文本到可用资产的Hi3D+Codex方案实践
  • 超详细!Slash安装教程:CocoaPods与Xcode子项目两种方式轻松集成
  • OSED安全工具套件:Windows漏洞利用开发的终极利器
  • clang-tutor测试框架解析:如何使用LLVM LIT进行插件测试
  • 丝杆升降平台同步精度优化与控制系统设计
  • Vulkan-Zig:为Zig语言量身打造的终极Vulkan绑定生成器完全指南
  • 3分钟快速部署:Docker SFTP服务器终极指南
  • 基于CNN-GRU和SHAP的DOA信号分类与可解释分析
  • AgnosticUI与AI代理协作:提升开发效率的5个实用技巧
  • CANN/ge LLM-DataDist 附录
  • EditAnything未来发展路线图:即将推出的令人期待的10个AI视频编辑功能
  • Clang插件架构深度解析:从clang-tutor学习插件设计模式
  • Navicat for Mac无限试用解决方案:三合一脚本破解14天限制
  • uiv常见问题解答:解决90%开发者遇到的集成难题
  • Qwen3.6-35B-A3B无审查模型深度解析:5个核心特性与高效部署实战指南
  • jinjava与Spring Boot集成:构建企业级应用的完整教程
  • Vault-Operator故障排除手册:常见问题与解决方案汇总
  • clang-tutor的Obfuscator插件:深入理解整数运算混淆技术
  • Packtpub-crawler云存储集成:如何自动上传电子书到Google Drive和OneDrive
  • Mhook高级技巧:处理x86/x64兼容性与线程安全的完整指南
  • KVAE-Audio未来发展方向:音频AI技术的创新与突破