当前位置：首页 > news >正文

GLM-Image WebUI部署教程：系统监控（GPU温度/显存/负载）集成方案

news 2026/4/3 4:44:49

GLM-Image WebUI部署教程：系统监控（GPU温度/显存/负载）集成方案

1. 引言：为什么需要系统监控？

如果你用过AI画图工具，肯定遇到过这种情况：输入一段描述，满怀期待地点击生成，然后……电脑风扇开始狂转，屏幕卡住不动，你只能盯着进度条干等。心里不停地打鼓：“是程序卡死了吗？显卡会不会过热烧掉？我的显存还够不够用？”

这种不确定性让人很焦虑。特别是像GLM-Image这样的大模型，生成一张高清图片可能需要几分钟，甚至十几分钟。在这段时间里，你完全不知道系统内部发生了什么。显卡是不是在全力工作？温度是否正常？显存有没有爆掉？这些信息你一概不知。

今天我要分享的，就是给GLM-Image WebUI装上一个“仪表盘”。就像开车要看时速表和油量表一样，运行AI模型时，你也需要实时了解系统的状态。这个方案能让你在生成图片的同时，一眼就看到：

GPU当前温度是多少度
显存用了多少，还剩多少
GPU的利用率（负载）高不高
系统内存和CPU的使用情况

有了这些信息，你就能安心地让模型跑起来，不用担心硬件出问题，也能更好地规划你的生成任务。

2. 准备工作：检查你的环境

在开始之前，我们先确认一下你的GLM-Image WebUI已经正常跑起来了。如果你还没部署，可以参考项目自带的README快速启动。

2.1 确认WebUI运行状态

打开终端，输入以下命令检查服务是否在运行：

ps aux | grep webui.py

如果看到类似下面的输出，说明WebUI正在运行：

root 12345 5.2 12.3 2456789 123456 ? Sl 10:30 2:15 python /root/build/webui.py

2.2 检查必要的Python包

我们的监控方案主要依赖几个Python库。先确认它们是否已经安装：

python -c "import psutil; import pynvml; print('所有依赖包已就绪')"

如果提示ModuleNotFoundError，说明缺少某些包。别担心，我们接下来会一起安装。

2.3 了解你的GPU信息

运行这个命令，看看你的显卡型号和支持的监控功能：

nvidia-smi --query-gpu=name,temperature.gpu,memory.total,memory.used --format=csv

你会看到类似这样的信息：

name, temperature.gpu, memory.total [MiB], memory.used [MiB] NVIDIA GeForce RTX 4090, 45, 24564, 1234

这表示你的RTX 4090显卡当前温度45度，总显存24GB，已使用约1.2GB。

3. 核心方案：为WebUI添加监控面板

现在进入正题。我们要在现有的GLM-Image WebUI界面上，增加一个实时监控区域。这个方案分为三个部分：获取系统数据、创建监控界面、集成到WebUI中。

3.1 创建监控数据获取模块

首先，我们创建一个专门获取系统信息的Python模块。新建一个文件system_monitor.py：

# system_monitor.py import psutil import time import threading from datetime import datetime try: import pynvml HAS_NVIDIA = True except ImportError: HAS_NVIDIA = False class SystemMonitor: """系统监控器，获取CPU、内存、GPU等信息""" def __init__(self, update_interval=2): """ 初始化监控器 Args: update_interval: 数据更新间隔（秒） """ self.update_interval = update_interval self.data = { 'cpu_percent': 0, 'memory_percent': 0, 'memory_used_gb': 0, 'memory_total_gb': 0, 'gpu_temperature': 0, 'gpu_memory_used_gb': 0, 'gpu_memory_total_gb': 0, 'gpu_utilization': 0, 'timestamp': datetime.now().strftime('%H:%M:%S') } self.running = False self.thread = None # 初始化NVIDIA管理库 if HAS_NVIDIA: try: pynvml.nvmlInit() self.gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(0) self.has_gpu = True except: self.has_gpu = False else: self.has_gpu = False def _update_data(self): """更新监控数据""" # CPU使用率 self.data['cpu_percent'] = psutil.cpu_percent(interval=0.1) # 内存信息 memory = psutil.virtual_memory() self.data['memory_percent'] = memory.percent self.data['memory_used_gb'] = round(memory.used / (1024**3), 2) self.data['memory_total_gb'] = round(memory.total / (1024**3), 2) # GPU信息（如果有NVIDIA显卡） if self.has_gpu: try: # GPU温度 temp = pynvml.nvmlDeviceGetTemperature(self.gpu_handle, pynvml.NVML_TEMPERATURE_GPU) self.data['gpu_temperature'] = temp # GPU显存 memory_info = pynvml.nvmlDeviceGetMemoryInfo(self.gpu_handle) self.data['gpu_memory_used_gb'] = round(memory_info.used / (1024**3), 2) self.data['gpu_memory_total_gb'] = round(memory_info.total / (1024**3), 2) # GPU利用率 util = pynvml.nvmlDeviceGetUtilizationRates(self.gpu_handle) self.data['gpu_utilization'] = util.gpu except Exception as e: print(f"获取GPU信息失败: {e}") self.has_gpu = False # 更新时间戳 self.data['timestamp'] = datetime.now().strftime('%H:%M:%S') def start(self): """启动监控线程""" if self.running: return self.running = True self.thread = threading.Thread(target=self._monitor_loop, daemon=True) self.thread.start() print("系统监控已启动") def _monitor_loop(self): """监控循环""" while self.running: self._update_data() time.sleep(self.update_interval) def stop(self): """停止监控""" self.running = False if self.thread: self.thread.join(timeout=2) if HAS_NVIDIA and self.has_gpu: pynvml.nvmlShutdown() def get_data(self): """获取当前监控数据""" return self.data.copy() def get_status_text(self): """获取格式化的状态文本""" data = self.data lines = [] lines.append(f"🕐 更新时间: {data['timestamp']}") lines.append(f" CPU使用率: {data['cpu_percent']}%") lines.append(f"🧠 内存: {data['memory_used_gb']}GB / {data['memory_total_gb']}GB ({data['memory_percent']}%)") if self.has_gpu: # 温度颜色提示 temp = data['gpu_temperature'] temp_status = "🟢" if temp > 80: temp_status = "🔴" elif temp > 70: temp_status = "🟡" lines.append(f"🎮 GPU温度: {temp}°C {temp_status}") lines.append(f" GPU显存: {data['gpu_memory_used_gb']}GB / {data['gpu_memory_total_gb']}GB") lines.append(f"⚡ GPU利用率: {data['gpu_utilization']}%") else: lines.append("🎮 GPU: 未检测到NVIDIA显卡或驱动") return "\n".join(lines) # 创建全局监控器实例 monitor = SystemMonitor()

这个模块做了几件重要的事情：

自动检测GPU：如果系统有NVIDIA显卡，就获取温度、显存、利用率信息
实时更新：每2秒更新一次数据，不会影响主程序性能
线程安全：在后台线程中运行，不会阻塞你的WebUI
格式友好：把数据转换成容易阅读的文本格式

3.2 修改WebUI主程序

接下来，我们需要修改GLM-Image的WebUI主程序，把监控面板加进去。找到/root/build/webui.py文件，在合适的位置添加监控功能。

首先，在文件开头导入我们的监控模块：

# 在webui.py的开头添加 import sys import os sys.path.append(os.path.dirname(os.path.abspath(__file__))) try: from system_monitor import monitor HAS_MONITOR = True except ImportError as e: print(f"监控模块导入失败: {e}") HAS_MONITOR = False

然后，在创建Gradio界面的部分，添加监控组件。找到类似下面的代码段（通常在定义interface的地方）：

# 在原有界面定义的基础上，添加监控组件 import gradio as gr # ... 原有的界面代码 ... # 添加监控状态显示框 monitor_output = gr.Textbox( label=" 系统监控", value="监控初始化中...", lines=8, interactive=False, elem_id="system_monitor" ) # 如果监控模块可用，启动监控 if HAS_MONITOR: monitor.start() # 创建更新监控数据的函数 def update_monitor(): if HAS_MONITOR: return monitor.get_status_text() return "监控模块未加载" # 设置定时更新 monitor_demo = gr.Interface( fn=update_monitor, inputs=[], outputs=monitor_output, live=True, refresh_interval=2000, # 每2秒更新一次 title="", allow_flagging="never" )

3.3 创建完整的集成版本

如果你不想直接修改原文件，我建议创建一个新的启动文件。新建webui_with_monitor.py：

#!/usr/bin/env python3 """ GLM-Image WebUI with System Monitor 集成系统监控的WebUI版本 """ import gradio as gr import torch from diffusers import StableDiffusionPipeline import os import sys from datetime import datetime # 添加当前目录到路径 sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 尝试导入监控模块 try: from system_monitor import monitor HAS_MONITOR = True print(" 系统监控模块加载成功") except ImportError as e: print(f" 监控模块加载失败: {e}") print(" 将继续运行不带监控的版本") HAS_MONITOR = False # 模型路径配置 MODEL_PATH = "/root/build/cache/huggingface/hub/models--zai-org--GLM-Image" OUTPUT_DIR = "/root/build/outputs" # 创建输出目录 os.makedirs(OUTPUT_DIR, exist_ok=True) # 初始化模型变量 pipe = None def load_model(): """加载GLM-Image模型""" global pipe try: print("正在加载GLM-Image模型...") # 这里使用Diffusers管道加载模型 # 注意：GLM-Image可能需要特定的加载方式 # 请根据实际模型格式调整 pipe = StableDiffusionPipeline.from_pretrained( MODEL_PATH, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, safety_checker=None, requires_safety_checker=False ) if torch.cuda.is_available(): pipe = pipe.to("cuda") print("模型已加载到GPU") else: print("模型运行在CPU上") return " 模型加载成功！可以开始生成图像了。" except Exception as e: return f" 模型加载失败: {str(e)}" def generate_image(prompt, negative_prompt, width, height, num_steps, guidance_scale, seed): """生成图像""" if pipe is None: return None, "请先加载模型" try: # 设置随机种子 if seed == -1: generator = None else: generator = torch.Generator(device="cuda" if torch.cuda.is_available() else "cpu").manual_seed(seed) # 生成图像 print(f"开始生成图像: {prompt[:50]}...") with torch.autocast("cuda" if torch.cuda.is_available() else "cpu"): image = pipe( prompt=prompt, negative_prompt=negative_prompt, width=width, height=height, num_inference_steps=num_steps, guidance_scale=guidance_scale, generator=generator ).images[0] # 保存图像 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") seed_str = f"seed{seed}" if seed != -1 else "random" filename = f"{OUTPUT_DIR}/glm_{timestamp}_{seed_str}.png" image.save(filename) print(f"图像已保存: {filename}") return image, f" 生成完成！已保存到: {filename}" except torch.cuda.OutOfMemoryError: return None, " GPU显存不足！请尝试降低分辨率或使用CPU Offload。" except Exception as e: return None, f" 生成失败: {str(e)}" def get_system_status(): """获取系统状态信息""" if HAS_MONITOR: return monitor.get_status_text() # 简单的备选方案 import psutil import time cpu_percent = psutil.cpu_percent() memory = psutil.virtual_memory() memory_used = round(memory.used / (1024**3), 2) memory_total = round(memory.total / (1024**3), 2) memory_percent = memory.percent status = f""" 🕐 更新时间: {time.strftime('%H:%M:%S')} CPU使用率: {cpu_percent}% 🧠 内存: {memory_used}GB / {memory_total}GB ({memory_percent}%) 🎮 GPU监控: 监控模块未加载 """ return status # 创建界面 with gr.Blocks(title="GLM-Image WebUI with System Monitor", theme=gr.themes.Soft()) as demo: gr.Markdown(""" # GLM-Image 文本生成图像 ## 集成系统监控版本 """) with gr.Row(): # 左侧：控制面板 with gr.Column(scale=1): gr.Markdown("### ⚙ 控制面板") load_btn = gr.Button(" 加载模型", variant="primary") load_status = gr.Textbox(label="模型状态", interactive=False) gr.Markdown("---") prompt = gr.Textbox( label=" 正向提示词", placeholder="描述你想要生成的图像...", lines=3 ) negative_prompt = gr.Textbox( label="🚫 负向提示词 (可选)", placeholder="描述你不想要的内容...", lines=2 ) with gr.Row(): width = gr.Slider(512, 2048, value=1024, step=64, label="宽度") height = gr.Slider(512, 2048, value=1024, step=64, label="高度") num_steps = gr.Slider(20, 150, value=50, step=5, label="推理步数") guidance_scale = gr.Slider(1.0, 20.0, value=7.5, step=0.5, label="引导系数") seed = gr.Number(value=-1, label="随机种子 (-1为随机)") generate_btn = gr.Button(" 生成图像", variant="primary") # 中间：图像显示 with gr.Column(scale=2): gr.Markdown("### 🖼 生成结果") output_image = gr.Image(label="生成的图像", type="pil") output_status = gr.Textbox(label="生成状态", interactive=False) # 右侧：系统监控 with gr.Column(scale=1): gr.Markdown("### 系统监控") monitor_display = gr.Textbox( label="实时状态", value="初始化监控...", lines=10, interactive=False ) # 监控更新组件 monitor_demo = gr.Interface( fn=get_system_status, inputs=[], outputs=monitor_display, live=True, refresh_interval=2000, title="", allow_flagging="never" ) # 按钮事件 load_btn.click(load_model, outputs=load_status) generate_btn.click( generate_image, inputs=[prompt, negative_prompt, width, height, num_steps, guidance_scale, seed], outputs=[output_image, output_status] ) # 示例提示词 gr.Markdown("### 提示词示例") examples = gr.Examples( examples=[ ["A majestic dragon flying over a mystical mountain landscape at sunset, fantasy art, highly detailed, 8k", "", 1024, 1024, 50, 7.5, -1], ["Portrait of a cyberpunk samurai with neon lights, cinematic lighting, 8k ultra detailed", "blurry, low quality", 1024, 1024, 60, 8.0, 42], ["Cute cat wearing a spacesuit floating in space, digital art, vibrant colors", "ugly, deformed", 512, 512, 40, 6.0, -1] ], inputs=[prompt, negative_prompt, width, height, num_steps, guidance_scale, seed], outputs=[output_image, output_status], fn=generate_image ) # 启动监控 if HAS_MONITOR: monitor.start() print("系统监控已启动") # 启动界面 if __name__ == "__main__": demo.launch( server_name="0.0.0.0", server_port=7860, share=False )

这个完整版本包含了所有功能：

原有的GLM-Image生成功能
实时系统监控面板
更好的用户界面布局
错误处理和提示

4. 部署与使用指南

4.1 安装依赖包

首先，确保安装了必要的Python包。创建一个requirements_monitor.txt文件：

psutil>=5.9.0 pynvml>=11.5.0 gradio>=3.50.0 torch>=2.0.0 diffusers>=0.24.0 transformers>=4.35.0 accelerate>=0.24.0

然后安装它们：

pip install -r requirements_monitor.txt

4.2 文件结构安排

建议按以下结构组织文件：

/root/build/ ├── webui.py # 原始WebUI ├── webui_with_monitor.py # 带监控的新版本 ├── system_monitor.py # 监控模块 ├── start_with_monitor.sh # 新的启动脚本 ├── requirements_monitor.txt # 额外依赖 ├── outputs/ # 生成图像保存目录 └── cache/ # 模型缓存目录

4.3 创建启动脚本

新建start_with_monitor.sh启动脚本：

#!/bin/bash # GLM-Image WebUI with System Monitor 启动脚本 set -e echo "========================================" echo "GLM-Image WebUI with System Monitor" echo "========================================" # 设置环境变量 export HF_HOME="/root/build/cache/huggingface" export HUGGINGFACE_HUB_CACHE="/root/build/cache/huggingface/hub" export TORCH_HOME="/root/build/cache/torch" export HF_ENDPOINT="https://hf-mirror.com" # 检查依赖 echo "检查Python依赖..." python -c "import psutil, pynvml, gradio, torch, diffusers" 2>/dev/null || { echo "缺少依赖包，正在安装..." pip install -r /root/build/requirements_monitor.txt } # 启动WebUI echo "启动带系统监控的WebUI..." cd /root/build python webui_with_monitor.py

给脚本执行权限：

chmod +x /root/build/start_with_monitor.sh

4.4 启动与访问

现在可以启动带监控的WebUI了：

bash /root/build/start_with_monitor.sh

等待程序启动，你会看到类似这样的输出：

======================================== GLM-Image WebUI with System Monitor ======================================== 检查Python依赖... 系统监控模块加载成功 系统监控已启动 Running on local URL: http://0.0.0.0:7860

打开浏览器访问http://localhost:7860，就能看到集成监控面板的新界面了。

5. 监控数据解读与实用技巧

5.1 如何看懂监控数据

监控面板会显示以下几类信息，我来解释一下每项的含义：

CPU使用率

正常范围：10%-50%（空闲时可能更低）
超过80%：系统可能有点忙，但通常没问题
持续100%：可能需要检查是否有其他程序在占用CPU

内存使用

重点关注已用内存和百分比
如果超过90%，系统可能会变慢
GLM-Image本身会占用不少内存，这是正常的

GPU温度

安全范围：通常低于85°C
理想温度：60-75°C（满载时）
超过80°C：考虑改善散热
超过90°C：建议暂停使用，检查散热

GPU显存

这是最重要的指标之一
GLM-Image生成时，显存使用会明显上升
如果接近100%，下次生成可能会失败
留出1-2GB的余量比较安全

GPU利用率

0%-10%：空闲或轻负载
50%-80%：正常生成时的负载
90%-100%：全力工作中
如果生成时利用率很低，可能是遇到了瓶颈

5.2 根据监控调整生成参数

监控数据不仅能让你安心，还能帮你优化生成效果：

发现显存不足时

# 降低分辨率 width = 768 # 从1024降低到768 height = 768 # 或者减少推理步数 num_steps = 30 # 从50降低到30

GPU温度过高时

# 给GPU一些休息时间 import time def generate_with_cooldown(prompt): result = generate_image(prompt, ...) # 检查温度 if monitor.data['gpu_temperature'] > 80: print("GPU温度较高，等待冷却...") time.sleep(60) # 等待1分钟 return result

监控生成过程中的资源变化

你可以修改监控模块，记录生成前后的资源变化：

def log_generation_resources(prompt): """记录生成前后的资源使用情况""" before = monitor.get_data() image, status = generate_image(prompt, ...) after = monitor.get_data() print(f"生成前后对比:") print(f" 显存使用: {before['gpu_memory_used_gb']}GB → {after['gpu_memory_used_gb']}GB") print(f" GPU温度: {before['gpu_temperature']}°C → {after['gpu_temperature']}°C") print(f" 生成耗时: 约{after['timestamp'] - before['timestamp']}秒") return image, status

5.3 常见问题与解决方案

问题1：监控显示"未检测到NVIDIA显卡"

原因：可能是pynvml库没有正确安装，或者NVIDIA驱动有问题 解决： pip install pynvml nvidia-smi # 确认驱动正常

问题2：监控数据不更新

原因：可能是更新线程卡住了 解决：重启WebUI，或者检查system_monitor.py中的更新逻辑

问题3：监控影响生成速度

原因：监控更新太频繁 解决：增加更新间隔 monitor = SystemMonitor(update_interval=5) # 从2秒改为5秒

问题4：想监控更多指标

扩展：可以添加磁盘使用率、网络状态等 在SystemMonitor类中添加： def _update_data(self): # 原有代码... # 添加磁盘监控 disk = psutil.disk_usage('/') self.data['disk_percent'] = disk.percent