当前位置：首页 > news >正文

Qwen3-14b_int4_awq实操笔记：在Jupyter中调用vLLM API并嵌入Chainlit前端

news 2026/7/1 11:04:52

Qwen3-14b_int4_awq实操笔记：在Jupyter中调用vLLM API并嵌入Chainlit前端

1. 模型简介

Qwen3-14b_int4_awq是基于Qwen3-14b模型的量化版本，采用int4精度和AWQ（Activation-aware Weight Quantization）技术进行优化。这个版本通过AngelSlim工具进行压缩，在保持较高文本生成质量的同时，显著降低了模型对计算资源的需求。

主要特点：

模型大小缩减约75%
推理速度提升2-3倍
内存占用大幅降低
保持原模型90%以上的生成质量

2. 环境准备与部署验证

2.1 检查模型服务状态

在开始调用前，我们需要确认模型服务已成功部署。通过webshell执行以下命令查看日志：

cat /root/workspace/llm.log

成功部署的标志是在日志中看到类似以下内容：

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000

2.2 模型加载确认

模型完全加载需要一定时间，具体取决于硬件配置。可以通过以下方式确认模型是否就绪：

观察日志中的进度信息
检查GPU内存占用是否稳定
尝试发送简单测试请求

3. Jupyter中调用vLLM API

3.1 安装必要依赖

在Jupyter notebook中首先安装所需Python包：

!pip install requests chainlit

3.2 基础API调用示例

以下是调用vLLM API的基本代码框架：

import requests def generate_text(prompt, max_tokens=200): api_url = "http://localhost:8000/v1/completions" headers = {"Content-Type": "application/json"} payload = { "model": "Qwen3-14b-int4-awq", "prompt": prompt, "max_tokens": max_tokens, "temperature": 0.7 } response = requests.post(api_url, json=payload, headers=headers) return response.json()["choices"][0]["text"] # 示例调用 result = generate_text("请用简单语言解释量子计算") print(result)

3.3 进阶调用参数

vLLM API支持多种参数调整生成效果：

advanced_payload = { "model": "Qwen3-14b-int4-awq", "prompt": "写一篇关于深度学习的科普文章", "max_tokens": 500, "temperature": 0.8, # 控制创造性(0-1) "top_p": 0.9, # 核采样参数 "frequency_penalty": 0.5, # 减少重复 "presence_penalty": 0.3 # 鼓励多样性 }

4. 集成Chainlit前端

4.1 创建Chainlit应用

新建一个Python文件（如app.py）并添加以下内容：

import chainlit as cl import requests @cl.on_message async def main(message: str): # 调用vLLM API response = requests.post( "http://localhost:8000/v1/completions", json={ "model": "Qwen3-14b-int4-awq", "prompt": message, "max_tokens": 500 } ) # 发送响应到前端 await cl.Message( content=response.json()["choices"][0]["text"] ).send()

4.2 启动Chainlit界面

在终端运行以下命令启动前端：

chainlit run app.py

成功启动后，默认会在浏览器打开交互界面（通常为http://localhost:8000）。

4.3 界面功能扩展

Chainlit支持丰富的界面定制：

@cl.on_chat_start async def start_chat(): await cl.Message( content="欢迎使用Qwen3-14b模型交互界面！请输入您的问题..." ).send() @cl.on_message async def handle_message(message: str): # 添加加载指示器 msg = cl.Message(content="") await msg.send() # 模拟处理中状态 await msg.stream_token("正在思考...") # 获取模型响应 response = get_model_response(message) # 流式输出结果 for token in response.split(): await msg.stream_token(token + " ") time.sleep(0.05) await msg.update()