当前位置：首页 > news >正文

vllm部署qwen3-32b模型，推理服务兼容openai服务API 支持openclaw调用

news 2026/5/11 22:16:26

1. 宿主机配置

GPU：HG DCU K100_AI -E4x16 64GB
OS：Kylin Linux Advanced Server V10 (Halberd)
基于vllm qwen3-32B模型搭建openclaw

2、宿主机启动VLLM docker

#启动海光版本vllm docker docker run -it \ --name qwen3-32b \ --device=/dev/kfd \ --privileged \ --network=host \ --device=/dev/dri \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v /opt/hyhal:/opt/hyhal:ro \ -v /public/model/qwen/Qwen3-32B/:/workspace/Qwen3-32B:ro \ --group-add video \ --shm-size 64G \ -w /workspace \ image.sourcefind.cn:5000/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.1-rc5-rocblas104381-0915-das1.6-py3.10-20250916-rc2 \ bash #进入容器： docker exec -it qwen3-32b bash

3、容器内部环境变量设置

#设置环境变量： model_path=/workspace/Qwen3-32B model=qwen3-32B tp=4 time=$(date "+%m%d-%H%M") mode="cudagraph" data_type="fp16" port=8085 #启动服务： VLLM_LOG_LEVEL=DEBUG nohup vllm serve $model_path \ --served-model-name $model \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --host 0.0.0.0 \ --port $port \ --dtype float16 \ --tensor-parallel-size $tp \ --max-num-seqs 1024 \ --trust-remote-code \ --distributed-executor-backend=mp \ --no-enable-prefix-caching \ --max-model-len 40960 \ --max-seq-len-to-capture 40960 \ > /workspace/logs/vllm_$(date +%Y%m%d_%H%M%S).log 2>&1 &

4、获取模型列表，并请求测试

curl http://10.9.90.91:8085/v1/models curl http://10.9.90.91:8085/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen3-32B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "你好，请介绍一下自己"} ], "temperature": 0.7, "max_tokens": 1024 }'

5、获取模型

apt update apt install -y git-lfs git lfs install git clone https://www.modelscope.cn/Qwen/Qwen3-32B.git

6、简单的python proxy ，便于分析客户端请求体

#cat proxy/proxy.py import json from flask import Flask, request, Response import requests app = Flask(__name__) # vLLM 服务的真实地址 VLLM_URL = "http://127.0.0.1:8085" @app.route('/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE']) def proxy(path): url = f"{VLLM_URL}/{path}" # 1. 打印请求头和请求体 print(f"\n{'='*50}") print(f"收到请求: {request.method} {path}") if request.is_json: # 这里就是你最想看的 JSON 内容 print("请求体 (JSON):") print(json.dumps(request.json, indent=2, ensure_ascii=False)) else: print(f"请求体 (Raw): {request.get_data()}") print(f"{'='*50}\n") # 2. 转发请求到真实的 vLLM headers = {k: v for k, v in request.headers if k.lower() != 'host'} try: resp = requests.request( method=request.method, url=url, headers=headers, data=request.get_data(), cookies=request.cookies, allow_redirects=False, timeout=300 ) # 3. 构造并返回响应 excluded_headers = ['content-encoding', 'content-length', 'transfer-encoding', 'connection'] resp_headers = [ (name, value) for name, value in resp.raw.headers.items() if name.lower() not in excluded_headers ] return Response(resp.content, resp.status_code, resp_headers) except Exception as e: return f"Proxy Error: {str(e)}", 500 if __name__ == '__main__': # 监听 8086 端口 app.run(host='0.0.0.0', port=8086)

查看全文

http://www.jsqmd.com/news/418317/