当前位置：首页 > news >正文

ChatGLM3-6B部署与Web集成：Gradio/Streamlit/FastAPI三种方案

news 2026/3/26 21:51:09

ChatGLM3-6B部署与Web集成：Gradio/Streamlit/FastAPI三种方案

1. 项目概述

ChatGLM3-6B是智谱AI团队开源的大语言模型，具备32k超长上下文记忆能力。本文将详细介绍如何在本地服务器部署该模型，并通过三种主流Web框架（Gradio、Streamlit、FastAPI）实现交互式应用。

2. 环境准备与模型部署

2.1 基础环境配置

# 创建Python虚拟环境 conda create -n chatglm3-6b python=3.8 conda activate chatglm3-6b # 安装基础依赖 pip install torch==2.0.0 transformers==4.37.2 sentencepiece==0.1.99

2.2 模型下载与加载

使用ModelScope下载模型：

from modelscope import snapshot_download model_dir = snapshot_download('ZhipuAI/chatglm3-6b', cache_dir='/path/to/model')

基础调用示例：

from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("/path/to/model", trust_remote_code=True) model = AutoModel.from_pretrained("/path/to/model", trust_remote_code=True).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好") print(response)

3. Web集成方案对比

3.1 Gradio方案

Gradio提供快速构建AI演示界面的能力：

from transformers import AutoModel, AutoTokenizer import gradio as gr tokenizer = AutoTokenizer.from_pretrained("/path/to/model", trust_remote_code=True) model = AutoModel.from_pretrained("/path/to/model", trust_remote_code=True).half().cuda() model = model.eval() def predict(query, history=None): if history is None: history = [] for response, history in model.stream_chat(tokenizer, query, history=history): updates = [] for i, (q, r) in enumerate(history): updates.append(gr.update(visible=True, value=f"用户：{q}")) updates.append(gr.update(visible=True, value=f"AI：{r}")) yield [history] + updates demo = gr.ChatInterface(predict) demo.queue().launch(server_port=6006)

特点：

内置聊天界面组件
自动处理对话历史
支持流式输出
快速原型开发

3.2 Streamlit方案

Streamlit适合构建数据科学应用：

from transformers import AutoModel, AutoTokenizer import streamlit as st from streamlit_chat import message @st.cache_resource def load_model(): tokenizer = AutoTokenizer.from_pretrained("/path/to/model", trust_remote_code=True) model = AutoModel.from_pretrained("/path/to/model", trust_remote_code=True).half().cuda() return tokenizer, model tokenizer, model = load_model() if 'history' not in st.session_state: st.session_state['history'] = [] user_input = st.text_input("请输入您的问题") if user_input: with st.spinner("AI正在思考..."): for response, history in model.stream_chat(tokenizer, user_input, st.session_state['history']): st.session_state['history'] = history for i, (q, r) in enumerate(st.session_state['history']): message(q, is_user=True, key=f"{i}_user") message(r, key=f"{i}")

优势：

内置状态管理
更美观的UI组件
适合构建数据分析仪表盘
缓存机制提升性能

3.3 FastAPI+WebSocket方案

FastAPI适合构建生产级API服务：

from fastapi import FastAPI, WebSocket from transformers import AutoModel, AutoTokenizer import uvicorn app = FastAPI() tokenizer = AutoTokenizer.from_pretrained("/path/to/model", trust_remote_code=True) model = AutoModel.from_pretrained("/path/to/model", trust_remote_code=True).half().cuda() @app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() try: while True: data = await websocket.receive_json() query = data['query'] history = data.get('history', []) for response, history in model.stream_chat(tokenizer, query, history=history): await websocket.send_json({ "response": response, "history": history, "status": 202 }) await websocket.send_json({"status": 200}) except Exception as e: print(f"Error: {e}") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)

前端HTML示例：

<!DOCTYPE html> <html> <head> <title>ChatGLM3 WebSocket Chat</title> </head> <body> <input id="input" placeholder="输入消息..."> <button onclick="sendMessage()">发送</button> <div id="messages"></div> <script> const ws = new WebSocket("ws://localhost:8000/ws"); ws.onmessage = (event) => { const data = JSON.parse(event.data); document.getElementById("messages").innerHTML += `<div>AI: ${data.response}</div>`; }; function sendMessage() { const input = document.getElementById("input"); ws.send(JSON.stringify({query: input.value})); document.getElementById("messages").innerHTML += `<div>用户: ${input.value}</div>`; input.value = ""; } </script> </body> </html>

核心优势：

真正的双向实时通信
适合集成到现有Web应用
高性能异步处理
标准化API接口

4. 方案对比与选型建议

特性	Gradio	Streamlit	FastAPI
开发速度	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
UI灵活性	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
性能	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
适合场景	演示/原型	数据分析应用	生产级API服务
学习曲线	最简单	中等	较陡峭
扩展性	有限	中等	无限

选型建议：

快速演示：选择Gradio
数据科学应用：选择Streamlit
企业级集成：选择FastAPI
需要WebSocket实时通信：必须使用FastAPI

5. 高级功能实现

5.1 清言智能体API集成

import requests def get_access_token(api_key, api_secret): url = "https://chatglm.cn/chatglm/assistant-api/v1/get_token" response = requests.post(url, json={"api_key": api_key, "api_secret": api_secret}) return response.json()['result']['access_token'] def send_message(assistant_id, access_token, prompt): url = "https://chatglm.cn/chatglm/assistant-api/v1/stream" headers = {"Authorization": f"Bearer {access_token}"} data = {"assistant_id": assistant_id, "prompt": prompt} with requests.post(url, json=data, headers=headers, stream=True) as response: for line in response.iter_lines(): if line: print(line.decode('utf-8'))

5.2 OpenAI API兼容接口

通过Docker部署兼容OpenAI的API服务：

docker run -p 8000:8000 -e API_KEY=your_key vinlic/zhipuai-agent-to-openai:latest

客户端调用示例：

from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="your_key") response = client.chat.completions.create( model="your_assistant_id", messages=[{"role": "user", "content": "你好"}] ) print(response.choices[0].message.content)