当前位置：首页 > news >正文

基于vLLM Ascend在昇腾上部署Qwen3-Coder-Next，手把手指南来了！

news 2026/5/12 18:38:16

2月4日，千问Qwen3-Coder-Next正式开源，这是一款专为编程智能体与本地开发设计的开源权重语言模型。昇腾适配支持供开发者尝鲜体验。适配模型及权重已同时上线魔乐社区，欢迎开发者们下载！

🔗 权重链接：https://modelers.cn/models/Qwen-AI/Qwen3-Coder-Next
🔗 昇腾推理指南：https://modelers.cn/models/vLLM_Ascend/Qwen3-Coder-Next

01模型亮点介绍

Qwen3-Coder-Next是一个高稀疏性的混合专家模型（MoE）。该模型基于 Qwen3-Next-80B-A3B-Base 构建，采用混合注意力与 MoE 的新架构；通过大规模可执行任务合成、环境交互与强化学习进行智能体训练，在显著降低推理成本的同时，获得了强大的编程与智能体能力。

Qwen3-Coder-Next 不依赖单纯的参数扩展，而是聚焦于扩展智能体训练信号。使用大规模的可验证编程任务与可执行环境进行训练，使模型能够直接从环境反馈中学习。尽管激活参数规模很小，该模型在多项智能体评测上仍能匹敌或超过若干更大的开源模型。

这款轻量且高效的代码模型可集成到多种下游应用中，例如 OpenClaw、Qwen Code、Claude Code、Web 开发、浏览器使用、Cline 等场景中。

以下手把手教你基于vLLM Ascend在昇腾上部署该模型。

02获取权重

可在魔乐社区快速下载模型权重：

https://modelers.cn/models/Qwen-AI/Qwen3-Coder-Next

Qwen3-Coder-Next已在vllm-ascend:v0.14.0rc1版本镜像支持。

03部署模型

启动Docker容器

# Update the vllm-ascend image # For Atlas A2 machines: # export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| # For Atlas A3 machines: export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3 docker run --rm \ --shm-size=1g \ --name qwen3-coder-next \ --device /dev/davinci0 \ --device /dev/davinci1 \ --device /dev/davinci2 \ --device /dev/davinci3 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -p 8000:8000 \ -it $IMAGE bash

需要确保你的环境中有Triton Ascend以运行该模型 (https://gitee.com/ascend/triton-ascend)。

pip install triton-ascend==3.2.0

推理

离线推理

执行以下离线脚本，给模型输入四条prompt：

import os os.environ["VLLM_USE_MODELSCOPE"] = "True" os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn" from vllm import LLM, SamplingParams def main(): prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.0) # Create an LLM. llm = LLM(model="/path/to/model/Qwen3-Coder-Next/", tensor_parallel_size=4, trust_remote_code=True, max_model_len=10000, gpu_memory_utilization=0.8, max_num_seqs=4, max_num_batched_tokens = 4096, compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY",}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == "__main__": main()

在线推理

执行以下脚本启动一个在线的服务：

vllm serve /path/to/model/Qwen3-Coder-Next/ --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

然后执行以下脚本向模型发送一条请求：

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "The future of AI is", "path": "/path/to/model/Qwen3-Coder-Next/", "max_tokens": 100, "temperature": 0 }'

执行结束后，你可以看到模型回答如下：

Prompt: 'The future of AI is', Generated text: ' not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and'

当前仅为尝鲜体验，性能优化中。如您在部署的过程中，发现任何问题（包括但不限于功能问题、合规问题），请在模型代码仓提交issue，开发者将及时审视并解答。
🔗 https://modelers.cn/models/vLLM_Ascend/Qwen3-Coder-Next

查看全文

http://www.jsqmd.com/news/351042/