当前位置：首页 > news >正文

保姆级教程：基于vLLM快速启动Qwen3-Reranker-0.6B服务

news 2026/5/12 22:53:11

保姆级教程：基于vLLM快速启动Qwen3-Reranker-0.6B服务

1. 环境准备与快速部署

在开始之前，请确保您的系统满足以下要求：

操作系统：推荐使用Ubuntu 20.04/22.04或CentOS 7/8
硬件配置：
- CPU：至少4核
- 内存：建议16GB以上
- GPU（可选）：NVIDIA显卡（推荐RTX 3090及以上）可显著提升性能
软件依赖：
- Python 3.8+
- pip 20.0+
- CUDA 11.7+（如需GPU加速）

1.1 一键部署命令

使用以下命令快速部署Qwen3-Reranker-0.6B服务：

# 创建并激活虚拟环境 python -m venv qwen_env source qwen_env/bin/activate # 安装依赖库 pip install torch transformers vllm gradio # 下载模型（可选，镜像已预装） # wget https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/resolve/main/model.safetensors

2. 服务启动与验证

2.1 使用vLLM启动服务

运行以下命令启动Qwen3-Reranker-0.6B服务：

python -m vllm.entrypoints.api_server \ --model Qwen/Qwen3-Reranker-0.6B \ --tensor-parallel-size 1 \ --port 8000 \ --trust-remote-code

参数说明：

--tensor-parallel-size：GPU并行数量（单卡设为1）
--port：服务监听端口
--trust-remote-code：允许执行远程代码（Qwen模型需要）

2.2 检查服务状态

查看服务日志确认是否启动成功：

tail -f /root/workspace/vllm.log

正常启动后，您应该能看到类似以下输出：

INFO 07-10 15:30:12 llm_engine.py:72] Initializing an LLM engine with config:... INFO 07-10 15:30:15 model_runner.py:54] Loading model weights... INFO 07-10 15:30:18 api_server.py:120] Serving on http://0.0.0.0:8000

3. 使用Gradio WebUI调用

3.1 启动Web界面

创建一个Python脚本webui.py，内容如下：

import gradio as gr import requests def rerank(query, documents): api_url = "http://localhost:8000/generate" payload = { "prompt": f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nRerank these documents for query: {query}\nDocuments:\n{documents}<|im_end|>", "max_tokens": 512 } response = requests.post(api_url, json=payload) return response.json()["text"] iface = gr.Interface( fn=rerank, inputs=[ gr.Textbox(label="Query", placeholder="Enter your search query..."), gr.Textbox(label="Documents", placeholder="Paste documents to rerank (one per line)...", lines=10) ], outputs=gr.Textbox(label="Reranked Results"), title="Qwen3-Reranker-0.6B Demo" ) iface.launch(server_port=7860)

启动Web界面：

python webui.py

3.2 界面使用指南

在浏览器访问http://<服务器IP>:7860
在"Query"输入框输入您的搜索查询
在"Documents"区域输入待排序的文档（每行一个文档）
点击"Submit"按钮获取重排序结果

示例输入：

Query: 什么是机器学习？ Documents: 机器学习是人工智能的一个分支 深度学习需要大量标注数据 监督学习使用带标签的数据集 强化学习通过奖励机制学习

4. 进阶使用技巧

4.1 批量处理优化

对于大批量文档处理，建议使用以下优化方法：

from vllm import LLM, SamplingParams # 初始化模型 llm = LLM(model="Qwen/Qwen3-Reranker-0.6B") # 准备批量输入 prompts = [ "Query: 神经网络原理\nDoc1: 神经网络模仿人脑结构\nDoc2: 反向传播是训练关键", "Query: Python特点\nDoc1: Python是解释型语言\nDoc2: 动态类型系统" ] # 设置生成参数 sampling_params = SamplingParams(temperature=0.7, top_p=0.9) # 批量生成 outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Generated text: {output.outputs[0].text}")