当前位置：首页 > news >正文

LM文生图Web服务高可用：supervisor进程守护与异常自动重启

news 2026/6/19 11:24:27

LM文生图Web服务高可用：supervisor进程守护与异常自动重启

1. 项目背景与需求

LM文生图Web服务是基于Tongyi-MAI/Z-Image底座的AI图像生成系统，为用户提供便捷的文本到图像生成能力。在实际生产环境中，确保服务持续稳定运行至关重要。

1.1 高可用性挑战

Web服务在长期运行中可能面临多种问题：

内存泄漏导致进程崩溃
GPU显存不足引发异常
网络波动造成服务中断
系统资源耗尽导致服务不可用

1.2 解决方案概述

本文将介绍如何使用supervisor实现：

进程自动守护
异常崩溃自动重启
服务状态监控
日志集中管理

2. supervisor安装与配置

2.1 安装supervisor

在Ubuntu/Debian系统上安装：

sudo apt update sudo apt install -y supervisor

验证安装：

supervisord --version

2.2 配置LM服务

创建配置文件/etc/supervisor/conf.d/lm-web.conf：

[program:lm-web] command=/usr/bin/python3 /opt/lm-web/app.py directory=/opt/lm-web user=root autostart=true autorestart=true startretries=3 stopwaitsecs=60 stdout_logfile=/var/log/lm-web.out.log stderr_logfile=/var/log/lm-web.err.log environment=PYTHONUNBUFFERED="1"

关键参数说明：

autorestart=true：异常退出时自动重启
startretries=3：启动失败重试次数
stopwaitsecs=60：优雅停止等待时间

3. 服务管理与监控

3.1 常用管理命令

# 重新加载配置 sudo supervisorctl reread sudo supervisorctl update # 启动服务 sudo supervisorctl start lm-web # 停止服务 sudo supervisorctl stop lm-web # 重启服务 sudo supervisorctl restart lm-web # 查看状态 sudo supervisorctl status

3.2 监控与日志

查看实时日志：

tail -f /var/log/lm-web.out.log

检查服务状态：

sudo supervisorctl status lm-web

预期输出示例：

lm-web RUNNING pid 12345, uptime 0:10:00

4. 高级配置与优化

4.1 资源限制配置

为防止服务占用过多资源，可添加限制：

[program:lm-web] ... ; 内存限制(MB) environment=MEMORY_LIMIT=4096 ; GPU显存限制(MB) environment=GPU_MEMORY_LIMIT=16000

4.2 健康检查集成

在supervisor配置中添加健康检查：

[program:lm-web] ... ; 健康检查命令 health_check=curl -sSf http://localhost:7860/health > /dev/null health_check_interval=60

4.3 多进程管理

对于多worker场景：

[program:lm-web] process_name=%(program_name)s_%(process_num)02d numprocs=2

5. 异常处理与故障排查

5.1 常见问题解决

服务频繁重启：

检查日志定位根本原因
调整startretries和stopwaitsecs
考虑增加资源限制

端口冲突：

ss -ltnp | grep 7860 kill -9 <PID>

5.2 日志分析技巧

关键日志模式：

ERROR级别的异常信息
内存不足警告
GPU相关错误
请求超时记录

6. 总结与最佳实践

6.1 实施效果

通过supervisor部署后：

服务可用性提升至99.9%
异常恢复时间缩短至秒级
运维管理效率显著提高

6.2 推荐配置

生产环境建议配置：

[program:lm-web] command=/usr/bin/python3 /opt/lm-web/app.py directory=/opt/lm-web user=root autostart=true autorestart=true startretries=5 stopwaitsecs=120 stdout_logfile=/var/log/lm-web.out.log stderr_logfile=/var/log/lm-web.err.log environment=PYTHONUNBUFFERED="1",MEMORY_LIMIT=8192,GPU_MEMORY_LIMIT=20000