当前位置：首页 > news >正文

GLM-OCR轻量级部署方案：CPU模式运行（FP16量化），满足边缘设备需求

news 2026/3/26 7:39:05

GLM-OCR轻量级部署方案：CPU模式运行（FP16量化），满足边缘设备需求

1. 引言

想象一下，你手头有一台没有独立显卡的普通电脑，或者一台性能有限的边缘计算设备，比如一台树莓派或者一台工业平板。现在，你需要在这台设备上运行一个强大的OCR（光学字符识别）模型，来处理文档、识别表格，甚至解析复杂的数学公式。这听起来是不是有点天方夜谭？

在过去，高性能的OCR模型往往意味着对GPU的强依赖，动辄需要数GB的显存，这让它们在资源受限的边缘设备上寸步难行。但今天，我要分享的GLM-OCR部署方案，将彻底打破这个限制。

GLM-OCR是一个基于先进的多模态架构构建的OCR模型，它不仅能识别文字，还能理解表格结构、解析数学公式，功能相当全面。但它的原始模型有2.5GB大小，对硬件要求不低。不过别担心，通过FP16量化和CPU模式优化，我们可以让它在一台普通的笔记本电脑甚至更轻量的设备上流畅运行。

这篇文章，我将带你一步步实现GLM-OCR在CPU设备上的轻量级部署。无论你是开发者、研究者，还是需要在边缘端部署OCR应用的产品经理，这套方案都能帮你把强大的文档理解能力带到任何地方。

2. GLM-OCR模型简介

在开始部署之前，我们先简单了解一下GLM-OCR到底是什么，以及它为什么值得我们在CPU上费心部署。

2.1 模型架构与核心能力

GLM-OCR不是一个简单的文字识别工具。它基于GLM-V编码器-解码器架构，专门为复杂文档理解而设计。这意味着它不仅能“看到”文字，还能“理解”文档的结构和内容。

这个模型有几个让我特别欣赏的特点：

多任务一体化：一个模型搞定文本识别、表格识别、公式识别，不用来回切换不同的工具。
复杂文档处理：对于排版混乱、背景复杂、文字密集的文档，它依然能保持不错的识别准确率。
结构化输出：识别表格时，它能返回结构化的数据；识别公式时，它能输出LaTeX格式，方便后续处理。

2.2 技术亮点解析

GLM-OCR背后有一些有趣的技术创新，虽然我们不需要深入代码层面去理解，但知道这些能帮助我们更好地使用它：

多令牌预测（MTP）：传统的OCR模型通常一次只预测一个字符或单词，而GLM-OCR可以同时预测多个，这大大提升了训练效率和识别速度。
稳定的全任务强化学习：模型通过一种更稳定的学习机制，同时优化所有任务（文本、表格、公式），避免了某个任务表现太好而其他任务跟不上的问题。
高效的视觉编码器：它使用了在大规模图文数据上预训练的CogViT作为视觉编码器，这让模型对图像的理解能力更强。
轻量级跨模态连接：文字和图像信息如何有效结合是个难题，GLM-OCR用了一个轻量但高效的连接器来解决这个问题。

这些技术组合起来，让GLM-OCR在保持强大功能的同时，模型体积相对可控，为我们在CPU上部署提供了可能。

3. 为什么选择CPU模式+FP16量化？

你可能会问：现在GPU这么普及，为什么还要折腾CPU部署？这里有几个很实际的原因。

3.1 边缘设备的现实需求

不是所有场景都能配备高性能GPU。考虑下面这些情况：

工业现场：生产线上的工控机、质检设备，通常只有CPU。
移动设备：平板电脑、手持终端，GPU性能有限或者根本没有独立GPU。
成本敏感场景：大量部署时，每台设备都配GPU成本太高。
云端服务限制：有些云服务对GPU实例收费较高，或者根本不提供。

在这些场景下，CPU部署就成了唯一可行的选择。

3.2 FP16量化的魔力

FP16（半精度浮点数）量化是让大模型在资源受限设备上运行的关键技术。简单来说，它做了两件事：

减少内存占用：模型参数从32位浮点数（FP32）压缩到16位（FP16），内存占用直接减半。对于GLM-OCR的2.5GB模型，这意味着运行时的内存需求可以降到1.3GB左右。
提升计算效率：现代CPU对16位浮点数的计算有专门优化，虽然单次计算精度略有下降，但速度更快，整体吞吐量可能反而提升。

更重要的是，对于OCR这种任务，FP16带来的精度损失几乎可以忽略不计——人眼根本分辨不出识别结果的细微差异。

3.3 性能与成本的平衡

选择CPU模式+FP16量化，本质上是在性能、成本和部署灵活性之间找到一个平衡点：

成本：零额外硬件成本，利用现有CPU资源。
部署难度：无需安装CUDA、配置显卡驱动，环境更简单。
可移植性：一次部署，到处运行，不受GPU型号限制。
功耗：CPU模式通常比GPU模式功耗更低，对移动设备更友好。

当然，CPU模式的速度肯定比不上GPU，但对于很多实时性要求不高的场景（比如后台批量处理、离线分析），完全够用。

4. 环境准备与依赖安装

好了，理论说够了，我们开始动手。首先确保你的环境准备就绪。

4.1 系统要求检查

在开始之前，确认你的设备满足以下最低要求：

操作系统：Linux（Ubuntu 18.04+， CentOS 7+），Windows 10/11，或者macOS
内存：至少8GB RAM（推荐16GB以上，因为模型加载需要约3-4GB内存）
存储空间：至少10GB可用空间（用于模型文件和依赖）
Python版本：3.8-3.10（3.10.19是最佳选择）

如果你用的是Windows，建议使用WSL2（Windows Subsystem for Linux）来获得更好的兼容性。macOS用户则需要注意ARM架构（M1/M2/M3芯片）的特殊性，不过GLM-OCR对此有良好支持。

4.2 创建独立的Python环境

我强烈建议使用Conda或venv创建独立的环境，避免污染系统Python环境。这里以Conda为例：

# 创建名为glm-ocr的Python 3.10环境 conda create -n glm-ocr python=3.10.19 -y # 激活环境 conda activate glm-ocr

如果你没有安装Conda，用venv也可以：

# 创建虚拟环境 python -m venv glm-ocr-env # 激活环境（Linux/macOS） source glm-ocr-env/bin/activate # 激活环境（Windows） glm-ocr-env\Scripts\activate

4.3 安装核心依赖

GLM-OCR依赖几个关键的Python包。由于我们要在CPU上运行，安装时需要注意一些细节：

# 首先升级pip pip install --upgrade pip # 安装PyTorch（CPU版本） # 访问 https://pytorch.org/get-started/locally/ 获取最新安装命令 # 以下是适用于Linux的示例命令 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # 安装transformers（需要特定版本） pip install transformers==4.36.0 # 安装其他必要依赖 pip install gradio>=4.0.0 # 用于Web界面 pip install pillow>=9.0.0 # 图像处理 pip install opencv-python>=4.8.0 # 如果需要图像预处理 pip install numpy>=1.24.0 pip install pandas>=1.5.0 # 用于表格输出处理

重要提示：PyTorch的CPU版本安装命令会根据你的操作系统和Python版本有所不同。一定要去PyTorch官网查看最新的安装命令，确保安装正确。

4.4 验证环境

安装完成后，运行一个简单的测试脚本确认环境正常：

# test_env.py import torch import transformers import gradio print(f"PyTorch版本: {torch.__version__}") print(f"Transformers版本: {transformers.__version__}") print(f"Gradio版本: {gradio.__version__}") print(f"PyTorch是否可用: {torch.cuda.is_available() if hasattr(torch.cuda, 'is_available') else 'CPU模式'}") print(f"设备信息: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

运行这个脚本：

python test_env.py

你应该看到类似这样的输出：

PyTorch版本: 2.1.0 Transformers版本: 4.36.0 Gradio版本: 4.13.0 PyTorch是否可用: CPU模式 设备信息: cpu

如果一切正常，恭喜你，环境准备就绪！

5. GLM-OCR模型部署实战

现在进入最核心的部分：实际部署GLM-OCR模型。我会分步骤详细讲解。

5.1 获取模型文件

GLM-OCR模型可以从Hugging Face Model Hub获取。但由于模型较大（2.5GB），我建议先下载到本地，避免每次运行都重新下载。

# download_model.py from transformers import AutoModelForCausalLM, AutoTokenizer import os # 设置模型缓存路径（避免下载到默认位置） model_cache_dir = "./models/GLM-OCR" os.makedirs(model_cache_dir, exist_ok=True) # 模型ID model_id = "ZhipuAI/GLM-OCR" print("开始下载GLM-OCR模型...") print("这可能需要一些时间，模型大小约2.5GB") try: # 下载tokenizer tokenizer = AutoTokenizer.from_pretrained( model_id, cache_dir=model_cache_dir, trust_remote_code=True ) # 下载模型（自动进行FP16量化） model = AutoModelForCausalLM.from_pretrained( model_id, cache_dir=model_cache_dir, torch_dtype=torch.float16, # 关键：指定FP16精度 low_cpu_mem_usage=True, # 减少CPU内存使用 trust_remote_code=True ) print("模型下载完成！") print(f"模型保存路径: {model_cache_dir}") # 保存到本地，方便后续使用 model.save_pretrained(f"{model_cache_dir}/fp16") tokenizer.save_pretrained(f"{model_cache_dir}/fp16") except Exception as e: print(f"下载过程中出现错误: {e}") print("请检查网络连接，或尝试手动下载")

如果你遇到网络问题，也可以手动下载：

访问Hugging Face的GLM-OCR页面
下载所有模型文件（包括config.json, pytorch_model.bin等）
放到本地的./models/GLM-OCR目录下

5.2 创建启动脚本

为了让部署更简单，我们创建一个启动脚本。这个脚本会处理模型加载、服务启动等所有事情。

# serve_glm_ocr_cpu.py import torch import gradio as gr from transformers import AutoModelForCausalLM, AutoTokenizer from PIL import Image import os import time import warnings warnings.filterwarnings("ignore") class GLMOCRService: def __init__(self, model_path="./models/GLM-OCR/fp16"): """ 初始化GLM-OCR服务 Args: model_path: 模型本地路径 """ self.model_path = model_path self.device = torch.device("cpu") # 强制使用CPU print("正在加载GLM-OCR模型...") start_time = time.time() # 加载tokenizer self.tokenizer = AutoTokenizer.from_pretrained( model_path, trust_remote_code=True ) # 加载模型（FP16精度，CPU模式） self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, # FP16量化 low_cpu_mem_usage=True, # 低内存模式 device_map="cpu", # 强制使用CPU trust_remote_code=True ) # 设置为评估模式 self.model.eval() load_time = time.time() - start_time print(f"模型加载完成！耗时: {load_time:.2f}秒") print(f"设备: {self.device}") print(f"模型精度: FP16") def process_image(self, image, prompt_type="Text Recognition:"): """ 处理图像并返回识别结果 Args: image: PIL Image对象或图像路径 prompt_type: 任务类型提示词 Returns: 识别结果字符串 """ if isinstance(image, str): image = Image.open(image).convert("RGB") # 根据任务类型选择提示词 prompts = { "文本识别": "Text Recognition:", "表格识别": "Table Recognition:", "公式识别": "Formula Recognition:" } prompt = prompts.get(prompt_type, "Text Recognition:") # 准备输入 inputs = self.model.build_conversation_input_ids( self.tokenizer, query=prompt, images=[image] ) # 将输入移动到CPU inputs = {k: v.to(self.device) if hasattr(v, 'to') else v for k, v in inputs.items()} # 生成输出 with torch.no_grad(): # 禁用梯度计算，减少内存使用 outputs = self.model.generate( **inputs, max_new_tokens=512, # 控制生成长度 do_sample=False, # 贪婪解码，速度更快 temperature=1.0, top_p=0.9 ) # 解码输出 response = self.tokenizer.decode(outputs[0], skip_special_tokens=True) # 提取识别结果（去掉提示词部分） result = response.replace(prompt, "").strip() return result def batch_process(self, image_paths, prompt_type="Text Recognition:"): """ 批量处理图像 Args: image_paths: 图像路径列表 prompt_type: 任务类型 Returns: 识别结果列表 """ results = [] for i, img_path in enumerate(image_paths): print(f"处理第 {i+1}/{len(image_paths)} 张图片: {img_path}") try: result = self.process_image(img_path, prompt_type) results.append(result) except Exception as e: print(f"处理 {img_path} 时出错: {e}") results.append(f"错误: {str(e)}") return results def create_gradio_interface(): """创建Gradio Web界面""" # 初始化服务 service = GLMOCRService() def process_interface(image, task_type): """Gradio接口处理函数""" if image is None: return "请上传图片" try: result = service.process_image(image, task_type) return result except Exception as e: return f"处理出错: {str(e)}" # 创建界面 with gr.Blocks(title="GLM-OCR CPU版", theme=gr.themes.Soft()) as demo: gr.Markdown("# 🎯 GLM-OCR 轻量级OCR识别系统") gr.Markdown("### CPU模式运行 · FP16量化 · 支持复杂文档理解") with gr.Row(): with gr.Column(scale=1): image_input = gr.Image( label="上传图片", type="pil", sources=["upload", "clipboard"] ) task_type = gr.Radio( choices=["文本识别", "表格识别", "公式识别"], value="文本识别", label="选择识别任务" ) process_btn = gr.Button("开始识别", variant="primary") gr.Markdown("### 使用说明") gr.Markdown(""" 1. 上传PNG/JPG/WEBP格式图片 2. 选择识别任务类型 3. 点击"开始识别" 4. 查看右侧识别结果 **支持功能：** - 📝 文本识别：普通文字内容 - 📊 表格识别：结构化表格数据 - ∫ 公式识别：数学公式转LaTeX """) with gr.Column(scale=1): output_text = gr.Textbox( label="识别结果", lines=20, max_lines=50, interactive=False ) with gr.Row(): clear_btn = gr.Button("清空") copy_btn = gr.Button("复制结果") # 绑定事件 process_btn.click( fn=process_interface, inputs=[image_input, task_type], outputs=output_text ) clear_btn.click( fn=lambda: ("",), outputs=output_text ) copy_btn.click( fn=lambda x: x, inputs=output_text, outputs=output_text ) gr.Markdown("---") gr.Markdown("### 技术信息") gr.Markdown(f""" - **运行模式**: CPU + FP16量化 - **模型**: GLM-OCR (ZhipuAI) - **内存占用**: ~3-4GB - **支持格式**: PNG, JPG, WEBP """) return demo if __name__ == "__main__": # 创建并启动服务 demo = create_gradio_interface() # 启动参数配置 server_port = 7860 server_name = "0.0.0.0" # 允许外部访问 print(f"\n🚀 启动GLM-OCR服务...") print(f"📡 服务地址: http://localhost:{server_port}") print(f"🌐 外部访问: http://<你的IP>:{server_port}") print("⏳ 首次请求可能需要几秒钟加载模型...") demo.launch( server_name=server_name, server_port=server_port, share=False, # 不创建公开链接 debug=False )

5.3 创建便捷的启动脚本

为了更方便地启动服务，我们再创建一个Shell脚本：

#!/bin/bash # start_glm_ocr.sh echo "========================================" echo " GLM-OCR CPU轻量版启动脚本" echo "========================================" # 检查Python环境 if ! command -v python &> /dev/null; then echo "错误: 未找到Python，请先安装Python 3.8+" exit 1 fi # 检查依赖 echo "检查Python依赖..." python -c "import torch, gradio, transformers, PIL" 2>/dev/null if [ $? -ne 0 ]; then echo "缺少必要的Python包，正在安装..." pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu pip install transformers==4.36.0 gradio>=4.0.0 pillow>=9.0.0 fi # 检查模型文件 MODEL_DIR="./models/GLM-OCR/fp16" if [ ! -d "$MODEL_DIR" ]; then echo "未找到模型文件，请先运行 download_model.py 下载模型" echo "或者手动将模型文件放置到 $MODEL_DIR" read -p "是否现在下载模型？(y/n): " -n 1 -r echo if [[ $REPLY =~ ^[Yy]$ ]]; then python download_model.py else echo "请手动准备模型文件后重试" exit 1 fi fi # 创建日志目录 mkdir -p ./logs # 启动服务 echo "启动GLM-OCR服务..." echo "服务将在 http://localhost:7860 可用" echo "按 Ctrl+C 停止服务" echo "----------------------------------------" # 运行服务，输出重定向到日志文件 LOG_FILE="./logs/glm_ocr_$(date +%Y%m%d_%H%M%S).log" python serve_glm_ocr_cpu.py 2>&1 | tee "$LOG_FILE"

给脚本添加执行权限：

chmod +x start_glm_ocr.sh

5.4 启动服务并测试

现在一切就绪，启动服务：

./start_glm_ocr.sh

你会看到类似这样的输出：

======================================== GLM-OCR CPU轻量版启动脚本 ======================================== 检查Python依赖... 启动GLM-OCR服务... 服务将在 http://localhost:7860 可用 按 Ctrl+C 停止服务 ---------------------------------------- 正在加载GLM-OCR模型... 模型加载完成！耗时: 45.32秒 设备: cpu 模型精度: FP16 🚀 启动GLM-OCR服务... 📡 服务地址: http://localhost:7860 🌐 外部访问: http://<你的IP>:7860 ⏳ 首次请求可能需要几秒钟加载模型... Running on local URL: http://0.0.0.0:7860

打开浏览器，访问http://localhost:7860，你会看到一个简洁的Web界面。

测试一下功能：

找一张包含文字的图片上传
选择“文本识别”
点击“开始识别”
等待几秒钟，查看识别结果

如果一切正常，你应该能看到准确的文字识别结果。恭喜你，GLM-OCR已经在你的CPU设备上成功运行了！

6. 性能优化与实用技巧

虽然基础部署已经完成，但要让GLM-OCR在CPU上运行得更快、更稳定，还需要一些优化技巧。

6.1 内存使用优化

CPU部署最大的挑战是内存。GLM-OCR加载后大约需要3-4GB内存，如果你的设备内存有限，可以尝试这些方法：

# memory_optimized_service.py import torch import gc from contextlib import contextmanager class OptimizedGLMOCRService: def __init__(self, model_path): # 在加载模型前清理内存 gc.collect() torch.cuda.empty_cache() if torch.cuda.is_available() else None # 使用更低精度的量化（如果支持） self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="cpu", load_in_8bit=False, # 可以尝试8bit量化，但需要bitsandbytes trust_remote_code=True ) @contextmanager def low_memory_mode(self): """低内存模式上下文管理器""" original_state = self.model.training self.model.eval() # 设置更保守的生成参数 generate_kwargs = { "max_new_tokens": 256, # 减少生成长度 "do_sample": False, "temperature": 1.0, "top_p": 0.9, "repetition_penalty": 1.1, } try: yield generate_kwargs finally: # 清理内存 gc.collect() def process_with_low_memory(self, image, prompt): """低内存模式处理""" with self.low_memory_mode() as gen_kwargs: # 处理逻辑... pass

6.2 批量处理优化

如果需要处理大量图片，批量处理可以显著提升效率：

def optimized_batch_process(image_paths, batch_size=2): """ 优化的批量处理，控制内存使用 Args: image_paths: 图片路径列表 batch_size: 每批处理数量，根据内存调整 """ results = [] for i in range(0, len(image_paths), batch_size): batch = image_paths[i:i+batch_size] print(f"处理批次 {i//batch_size + 1}/{(len(image_paths)+batch_size-1)//batch_size}") batch_results = [] for img_path in batch: try: result = self.process_image(img_path) batch_results.append(result) except Exception as e: print(f"处理 {img_path} 失败: {e}") batch_results.append(None) results.extend(batch_results) # 批次间清理内存 gc.collect() return results

6.3 图像预处理技巧

适当的图像预处理可以提升识别准确率，特别是对于质量较差的图片：

from PIL import Image, ImageEnhance, ImageFilter import cv2 import numpy as np def preprocess_image(image, enhance_contrast=True, denoise=True, resize_limit=2048): """ 图像预处理函数 Args: image: PIL Image对象 enhance_contrast: 是否增强对比度 denoise: 是否去噪 resize_limit: 最大尺寸限制 Returns: 预处理后的PIL Image """ # 转换为RGB（如果是RGBA或灰度图） if image.mode != 'RGB': image = image.convert('RGB') # 限制图像尺寸，避免过大图像 width, height = image.size if max(width, height) > resize_limit: ratio = resize_limit / max(width, height) new_size = (int(width * ratio), int(height * ratio)) image = image.resize(new_size, Image.Resampling.LANCZOS) # 增强对比度（对低质量图片有帮助） if enhance_contrast: enhancer = ImageEnhance.Contrast(image) image = enhancer.enhance(1.2) # 增强20% # 轻度锐化 enhancer = ImageEnhance.Sharpness(image) image = enhancer.enhance(1.1) # 去噪（使用PIL的简单去噪） if denoise: # 转换为numpy数组进行OpenCV处理 img_array = np.array(image) # 轻度高斯模糊去噪 img_array = cv2.GaussianBlur(img_array, (3, 3), 0) # 转换回PIL Image image = Image.fromarray(img_array) return image # 在process_image函数中使用预处理 def process_image_with_preprocess(self, image_path, prompt_type): """带预处理的图像处理""" # 加载并预处理图像 image = Image.open(image_path).convert("RGB") processed_image = preprocess_image(image) # 使用预处理后的图像进行识别 return self.process_image(processed_image, prompt_type)

6.4 缓存机制

对于重复处理的图片，可以添加缓存机制：

import hashlib import pickle import os class CachedGLMOCRService: def __init__(self, model_path, cache_dir="./cache"): self.service = GLMOCRService(model_path) self.cache_dir = cache_dir os.makedirs(cache_dir, exist_ok=True) def get_cache_key(self, image_path, prompt_type): """生成缓存键""" # 使用文件内容和提示词生成唯一键 with open(image_path, 'rb') as f: file_hash = hashlib.md5(f.read()).hexdigest() key = f"{file_hash}_{prompt_type}" return key def process_with_cache(self, image_path, prompt_type): """带缓存的处理""" cache_key = self.get_cache_key(image_path, prompt_type) cache_file = os.path.join(self.cache_dir, f"{cache_key}.pkl") # 检查缓存 if os.path.exists(cache_file): print(f"使用缓存结果: {cache_key}") with open(cache_file, 'rb') as f: return pickle.load(f) # 处理并缓存 result = self.service.process_image(image_path, prompt_type) with open(cache_file, 'wb') as f: pickle.dump(result, f) return result

7. 实际应用案例

理论和技术讲了不少，现在看看GLM-OCR在CPU模式下能做什么实际的事情。

7.1 文档数字化归档

假设你有一堆纸质文档需要数字化。传统OCR工具可能无法处理复杂的排版，但GLM-OCR可以：

import os from pathlib import Path def batch_digitize_documents(input_folder, output_folder): """ 批量数字化文档文件夹 Args: input_folder: 包含图片的文件夹 output_folder: 输出文本文件的文件夹 """ input_path = Path(input_folder) output_path = Path(output_folder) output_path.mkdir(exist_ok=True) # 支持的图片格式 image_extensions = ['.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.webp'] # 获取所有图片文件 image_files = [] for ext in image_extensions: image_files.extend(input_path.glob(f"*{ext}")) image_files.extend(input_path.glob(f"*{ext.upper()}")) print(f"找到 {len(image_files)} 个图片文件") # 初始化OCR服务 ocr_service = GLMOCRService() for i, img_file in enumerate(image_files, 1): print(f"处理 [{i}/{len(image_files)}]: {img_file.name}") try: # 识别文本 text_result = ocr_service.process_image( str(img_file), "Text Recognition:" ) # 保存结果 output_file = output_path / f"{img_file.stem}.txt" with open(output_file, 'w', encoding='utf-8') as f: f.write(f"文件名: {img_file.name}\n") f.write(f"识别时间: {datetime.now()}\n") f.write("-" * 50 + "\n") f.write(text_result) print(f" 已保存: {output_file}") except Exception as e: print(f" 处理失败: {e}") print("批量处理完成！")

7.2 表格数据提取

从图片中提取表格数据是GLM-OCR的强项：

import pandas as pd import json def extract_table_from_image(image_path, output_format='csv'): """ 从图片中提取表格数据 Args: image_path: 图片路径 output_format: 输出格式，支持 'csv', 'excel', 'json' Returns: 表格数据（DataFrame或字典） """ # 初始化服务 service = GLMOCRService() # 识别表格 print("识别表格结构...") table_result = service.process_image(image_path, "Table Recognition:") # 解析表格结果 # GLM-OCR的表格识别结果通常是结构化的文本 # 我们需要将其转换为DataFrame try: # 尝试解析为JSON（如果模型返回JSON格式） if table_result.strip().startswith('{'): table_data = json.loads(table_result) df = pd.DataFrame(table_data) else: # 否则按文本行解析 lines = table_result.strip().split('\n') data = [] for line in lines: if '|' in line: # 假设是Markdown表格格式 row = [cell.strip() for cell in line.split('|') if cell.strip()] data.append(row) if len(data) > 1: df = pd.DataFrame(data[1:], columns=data[0]) else: df = pd.DataFrame([table_result], columns=['content']) except Exception as e: print(f"解析表格失败: {e}") df = pd.DataFrame({'原始结果': [table_result]}) # 保存结果 if output_format.lower() == 'csv': output_path = image_path.replace('.jpg', '.csv').replace('.png', '.csv') df.to_csv(output_path, index=False, encoding='utf-8-sig') print(f"表格已保存为CSV: {output_path}") elif output_format.lower() == 'excel': output_path = image_path.replace('.jpg', '.xlsx').replace('.png', '.xlsx') df.to_excel(output_path, index=False) print(f"表格已保存为Excel: {output_path}") elif output_format.lower() == 'json': output_path = image_path.replace('.jpg', '.json').replace('.png', '.json') df.to_json(output_path, orient='records', force_ascii=False, indent=2) print(f"表格已保存为JSON: {output_path}") return df # 使用示例 table_data = extract_table_from_image("invoice.jpg", output_format='excel') print(f"提取到 {len(table_data)} 行数据") print(table_data.head())

7.3 数学公式识别

对于学术文档，公式识别特别有用：

def extract_formulas_from_document(image_path, output_latex=True): """ 从文档图片中提取数学公式 Args: image_path: 文档图片路径 output_latex: 是否输出LaTeX格式 Returns: 公式列表 """ service = GLMOCRService() print("识别文档中的公式...") formula_result = service.process_image(image_path, "Formula Recognition:") # 解析公式结果 formulas = [] # GLM-OCR可能返回多个公式，用特定分隔符分隔 if '---' in formula_result: formula_list = formula_result.split('---') elif '\n\n' in formula_result: formula_list = formula_result.split('\n\n') else: formula_list = [formula_result] for i, formula in enumerate(formula_list, 1): formula = formula.strip() if formula: formulas.append({ 'id': i, 'formula': formula, 'latex': f"${formula}$" if output_latex else formula }) print(f"识别到 {len(formulas)} 个公式") # 保存结果 if formulas: output_file = image_path.replace('.jpg', '_formulas.md').replace('.png', '_formulas.md') with open(output_file, 'w', encoding='utf-8') as f: f.write(f"# 公式识别结果\n\n") f.write(f"源文件: {Path(image_path).name}\n") f.write(f"识别时间: {datetime.now()}\n\n") for formula in formulas: f.write(f"## 公式 {formula['id']}\n\n") f.write(f"**识别结果**: {formula['formula']}\n\n") if output_latex: f.write(f"**LaTeX格式**: `{formula['latex']}`\n\n") f.write("---\n\n") print(f"公式结果已保存: {output_file}") return formulas # 使用示例 formulas = extract_formulas_from_document("math_paper.png") for formula in formulas: print(f"公式 {formula['id']}: {formula['formula']}")