当前位置: 首页 > news >正文

在ec2上部署qwen3-VL-2B模型

测试环境如下

g5.4xlarge
EBS: 200GB
AMI:ami-0a83c884ad208dfbc ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20250419

安装nvidia驱动和cuda toolkit

查看PCIE设备

  • 性能参数参考,https://www.nvidia.cn/data-center/products/a10-gpu/
$ lspci|grep NVIDIA
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

安装nvidia驱动,类别为data center,寻找并下载12.8的cuda驱动版本,https://www.nvidia.cn/drivers/details/242978/

sudo apt install linux-headers-$(uname -r) gcc make -y
sudo bash NVIDIA-Linux-x86_64-570.133.20.run

报错和原因,这个错误信息表明 NVIDIA 驱动程序安装器无法找到当前运行内核的源代码。NVIDIA 的驱动程序需要内核头文件和源代码来编译与内核兼容的模块。

ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

查看安装成功,单卡P8显存23GB

image.png

安装cuda toolkit,使用同样的版本,https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=runfile_local

wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
sudo sh cuda_12.8.1_570.124.06_linux.run

安装过程会自动使用make -j x来使用所有cpu编译,日志可以查看报错或结束提示

$ cat /var/log/cuda-installer.log
...
[INFO]: Finished with code: 36096
[ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details.# 一直报错编译错误,应该是内核版本不兼容
/tmp/selfgz35754/NVIDIA-Linux-x86_64-570.124.06/kernel-open/nvidia/nv-mmap.c:321:5: warning: conflicting types for 'nv_encode_caching' due to enum/integer mismatch; have 'int(pgprot_t *, NvU32,  nv_memory_type_t)' {aka 'int(struct pgprot *, unsigned int,  nv_memory_type_t)'} [-Wenum-int-mismatch]

编译cudatoolkit一直报错,参考文档直接使用apt安装,这也体现了docker容器的优势,内置了cudatoolkit

$ sudo apt install nvidia-cuda-toolkit

添加PATH,非常重要否则后续会出现找不到*.so文件的错误。通过apt install就不需要手动添加了

export PATH=/usr/local/cuda-12.x/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.x/lib64:$LD_LIBRARY_PATH

查看安装成功

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

以上两个步骤需要编译,需要一定时间,尽量选择较新版本的cuda

由于后续需要使用docker运行模型,因此添加容器运行时支持,直接参考文档安装即可

sudo apt-get install -y nvidia-container-toolkit

镜像的选择和配置

在modelscope中将模型Qwen3-VL-2B-Instruct加载到S3中路径为s3://bucketname/Qwen/Qwen3-VL-2B-Instruct/

参考Qwen3 VL官方 document(https://github.com/QwenLM/Qwen3-VL),要求用到的transformers版本大于4.57.0,建议使用vllm版本0.11

The Qwen3-VL model requires transformers >= 4.57.0

We recommend using vLLM for fast Qwen3-VL deployment and inference. You need to install vllm>=0.11.0 to enable Qwen3-VL support.

考虑到环境的隔离性,使用镜像来完成模型的部署。

  • Qwen官方镜像:qwenllm/qwenvl:qwen3vl-cu128
  • 公开的vllm0.11版本镜像:public.ecr.aws/deep-learning-containers/vllm:0.11-gpu-py312

在实际拉取过程中会发现这两个镜像大部分的layers都是相同的。

其中Qwen官方镜像并没有入口命令可以用于调试。公开的public ecr镜像vllm:0.11-gpu-py312的入口为vllm官方的sagemaker入口脚本,内容如下,主要是为了兼容sagemaker服务的配置。

# /usr/local/bin/sagemaker_entrypoint.sh
bash /usr/local/bin/bash_telemetry.sh >/dev/null 2>&1 || truePREFIX="SM_VLLM_"
ARG_PREFIX="--"ARGS=(--port 8080)while IFS='=' read -r key value; doarg_name=$(echo "${key#"${PREFIX}"}" | tr '[:upper:]' '[:lower:]' | tr '_' '-')ARGS+=("${ARG_PREFIX}${arg_name}")if [ -n "$value" ]; thenARGS+=("$value")fi
done < <(env | grep "^${PREFIX}")exec python3 -m vllm.entrypoints.openai.api_server "${ARGS[@]}"

使用如下命令启动容器,将模型挂载到容器中

docker run --gpus all --ipc=host --network=host --rm --name qwen3vl \
-v /root/model/:/model -it qwenllm/qwenvl:qwen3vl-cu128 bash

此外Qwen官方镜像中已经内置了如下依赖,不需要额外安装了。为了支持流式加载模型,修改为vllm[runai]

# install requirements
pip install accelerate
pip install qwen-vl-utils==0.0.14
# pip install -U vllm
pip install -U vllm[runai]

启动vllm引擎,由于默认的--max-model-len会导致显存溢出,设置为较小的值,并且此模型没有MOE,因此去掉--enable-expert-parallel,由于是单卡因此调整tp-size为1

vllm serve /model/Qwen3-VL-2B-Instruct \--load-format runai_streamer \--tensor-parallel-size 1 \--mm-encoder-tp-mode data \--async-scheduling \--media-io-kwargs '{"video": {"num_frames": -1}}' \--host 0.0.0.0 \--port 8000 \--max-model-len 8945

列出模型

$ curl 127.0.0.1:8000/v1/models
{"object":"list","data":[{"id":"/model/Qwen3-VL-2B-Instruct","object":"model","created":1762944040,"owned_by":"vllm","root":"/model/Qwen3-VL-2B-Instruct","parent":null,"max_model_len":8945,"permission":[{"id":"modelperm-3310ed85128d425793b2c15bb5cb3d79","object":"model_permission","created":1762944040,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

尝试调用

curl http://127.0.0.1:8000/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer EMPTY" \-d '{"model": "/model/Qwen3-VL-2B-Instruct","messages": [{"role": "user","content": [{"type": "image_url","image_url": {"url": "https://n.sinaimg.cn/sinacn19/0/w2000h2000/20180618/d876-heauxvz1345994.jpg"}},{"type": "text","text": "帮我解读一下"}]}],"max_tokens": 1024}'

查看结果

image.png

视频推理出现如下报错,表明超出了--max-model-len需要调整参数或更换更大的机型来部署

{"error":{"message":"The decoder prompt (length 13642) is longer than the maximum model length of 8945. Make sure that `max_model_len` is no smaller than the number of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.","type":"BadRequestError","param":null,"code":400}}

对于公开镜像public.ecr.aws/deep-learning-containers/vllm:0.11-gpu-py312,可以通过如下方式在eks集群中部署。指定serviceAccount以便于直接流式拉取位于S3中的模型文件,通过SM_VLLM前缀的环境变量来指定vllm引擎的参数。当然也可以考虑通过docker-compose完成部署。

apiVersion: apps/v1
kind: Deployment
metadata:name: vllm-openai-qwen3-vlnamespace: aitaolabels:app: vllm-openai
spec:replicas: 1selector:matchLabels:app: vllm-openaitemplate:metadata:labels:app: vllm-openaispec:serviceAccount: sa-service-account-apinodeSelector:eks.amazonaws.com/nodegroup: llm-ngcontainers:- name: vllm-openai-containerimage: public.ecr.aws/deep-learning-containers/vllm:0.11-gpu-py312env:- name: REGIONvalue: cn-northwest-1- name: SM_VLLM_MODELvalue: s3://bucketname/Qwen/Qwen3-VL-2B-Instruct- name: SM_VLLM_MAX_MODEL_LENvalue: "24896"- name: SM_VLLM_GPU_MEMORY_UTILIZATIONvalue: "0.9"- name: SM_VLLM_PORTvalue: "8000"- name: SM_VLLM_TENSOR_PARALLEL_SIZEvalue: "2"- name: SM_VLLM_LOAD_FORMATvalue: runai_streamer- name: SM_VLLM_SERVED_MODEL_NAMEvalue: Qwen3-VL- name: SM_VLLM_MAX_NUM_SEQSvalue: "1024"- name: AWS_DEFAULT_REGIONvalue: cn-northwest-1ports:- containerPort: 8000name: http-apiresources:limits:nvidia.com/gpu: 1memory: "16Gi"cpu: "4"requests:nvidia.com/gpu: 1memory: "16Gi"cpu: "4"restartPolicy: Always

报错为

(APIServer pid=1) OSError: Can't load the configuration of 's3://bucketname/Qwen/Qwen3-VL-2B-Instruct/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 's3://bucketname/Qwen/Qwen3-VL-2B-Instruct/' is the correct path to a directory containing a config.json file

推测是公开镜像中没有runai插件和相关依赖,使用如下dockerfile打包并上传到ecr仓库中

FROM public.ecr.aws/deep-learning-containers/vllm:0.11-gpu-py312# 设置python源
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/# 安装所需依赖
RUN pip install accelerate \&& pip install qwen-vl-utils==0.0.14 \&& uv pip install -U vllm[runai]

再次部署

      containers:- name: vllm-openai-containerimage: xxxxxxxxx.dkr.ecr.cn-north-1.amazonaws.com.cn/zhaojiew/vllm-sagemaker:latest

打印启动参数为

python3 -m vllm.entrypoints.openai.api_server --port 8000 --max-model-len 8896 --port 8000 --load-format runai_streamer --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --model s3://bucketname/Qwen/Qwen3-VL-2B-Instruct --served-model-name Qwen3-V

并没出现问题,查看了一些类似的issue,发现可能是0.11版本vllm存在问题,issue link

When trying to use Run.AI model streamer on vLLM 0.11 it breaks.
It seems to break with every new minor release of vLLM, same thing happened previously from 0.9.x to 0.10.x.

vllm serve s3://<path> --load-format runai_streamer
Repo id must be in the form 'repo_name' or 'namespace/repo_name': 's3://XXXXXXXXX'. Use `repo_type` argument if needed

目前的方案只能是通过ebs或efs挂载到pod中使用了,如下命令测试没有异常

python3 -m vllm.entrypoints.openai.api_server --max-model-len 8896 --port 8000 --max-num-seqs 1024 --load-format runai_streamer --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --model /model/Qwen3-VL-2B-Instruct --served-model-name Qwen3-VL

部署日志如下

image.png

http://www.jsqmd.com/news/38759/

相关文章:

  • 37
  • Daily Scrum 2025.11.12
  • 完整教程:mit6s081 lab8 locks
  • 软件工程学习日志2025.11.12
  • [集训队互测 2025] 火花 做题记录
  • 返璞归真,因为自指,所以自洽
  • NLTK库用法示例:Python自然语言处理入门到实践 - 实践
  • 2025大桶/桶装/纯净/瓶装/灌装水设备推荐榜:青州市路得自动化五星领跑 四大品牌赋能水企高效生产
  • 2025履带式/机场/智能驱鸟机器人系统推荐榜:申昊科技以AI赋能,破解多场景鸟害难题
  • 2025室外/攀爬/绳网/公园/景区/户外游乐设施企业口碑榜:全场景覆盖 + 实力出圈,这4家企业成采购优选
  • 2025年艺考文化课优选机构:聚焦艺考文化课机构/艺考文化课培训山东艺考文化课机构/封闭集训与精准提分核心竞争力
  • 2025年邦顿商用空气能厂家新实力榜:聚焦邦顿商用变频/商用变频冷暖/商用变频热泵/模块化应用优势!
  • 2025密集型/智能/防潮防腐/多层抽屉式/切片蜡块柜推荐榜:北京中宝元五星领跑 高容量智能存储方案成实验室优选
  • 专题:2025AI时代的医疗保健业:应用与行业趋势研究报告|附130+份报告PDF、数据、可视化模板汇总下载
  • 团队作业2——需求规格说明书
  • 实用指南:Java优选算法——位运算
  • 英语_阅读_Postman_待读
  • CF1984F Reconstruction
  • 英语_句子摘抄
  • 详细介绍:python编程基础知识
  • [USACO18JAN] G/S 题解
  • 计算机网络 —— 交换机 —— 二层交换机 or 三层交换机
  • IDM超详细安装下载教程,一次安装免费使用 Internet Download Manager
  • P7912 [CSP-J 2021] 小熊的果篮
  • 完整教程:对于环形链表、环形链表 II、随机链表的复制题目的解析
  • 第六章蓝墨云班习题
  • [network] IPv4 vs. IPv6 address pool
  • [Network] subnet mask
  • flask: 用flask-cors解决跨域问题
  • Linux小课堂: 用户管理与权限控制机制详解 - 实践