当前位置：首页 > news >正文

GLM-4v-9b实战教程：用AI识别图片中的文字和表格

news 2026/7/25 14:29:31

GLM-4v-9b实战教程：用AI识别图片中的文字和表格

1. 引言：为什么选择GLM-4v-9b进行图文识别

在日常工作和学习中，我们经常遇到需要从图片中提取文字或表格的场景。传统OCR工具往往只能识别简单的印刷体文字，对于复杂排版、手写体或表格数据的识别效果有限。GLM-4v-9b作为一款90亿参数的多模态模型，在1120×1120高分辨率输入下，能够准确识别图片中的文字内容，并理解表格结构，将视觉信息转化为可编辑的文本数据。

本教程将带你从零开始，使用GLM-4v-9b实现图片文字和表格的智能识别。相比传统OCR方案，GLM-4v-9b具有以下优势：

高精度识别：在基准测试中超越GPT-4-turbo等主流模型
中文优化：专门针对中文场景优化，识别准确率高
表格理解：不仅能识别文字，还能理解表格结构和关系
多轮对话：支持通过对话方式 refine 识别结果

2. 环境准备与快速部署

2.1 硬件要求

GLM-4v-9b支持多种部署方式，最低硬件要求如下：

GPU版本：NVIDIA显卡（RTX 4090及以上），显存≥24GB（FP16）或≥9GB（INT4量化）
内存：建议32GB以上
存储空间：模型文件约18GB（FP16）或9GB（INT4）

2.2 一键部署方法

推荐使用预置镜像快速部署，避免复杂的依赖安装：

# 使用Docker快速启动（需要NVIDIA Docker支持） docker run --gpus all -p 7860:7860 -v /path/to/models:/models glm-4v-9b-webui

等待服务启动后，在浏览器访问http://localhost:7860即可使用Web界面。

2.3 手动安装（适合开发者）

如需从源码安装，可按以下步骤操作：

# 创建Python虚拟环境 conda create -n glm4v python=3.10 conda activate glm4v # 安装依赖 git clone https://github.com/THUDM/GLM-4 cd GLM-4 pip install -r requirements.txt # 下载模型（可选择HuggingFace或ModelScope） git lfs install git clone https://huggingface.co/THUDM/glm-4v-9b

3. 基础使用：图片文字识别实战

3.1 单张图片识别

通过Python API可以轻松实现图片文字识别：

from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "THUDM/glm-4v-9b" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).cuda() # 加载图片 image = Image.open("receipt.jpg") # 构建提示词 query = "请识别图片中的所有文字内容，保持原始格式" # 获取识别结果 response, _ = model.chat(tokenizer, query=query, image=image) print(response)

3.2 批量图片处理

对于多张图片，可以使用以下脚本批量处理：

import os from concurrent.futures import ThreadPoolExecutor def process_image(img_path): image = Image.open(img_path) response, _ = model.chat(tokenizer, "识别图片文字", image=image) with open(f"{img_path}.txt", "w") as f: f.write(response) # 批量处理目录下所有jpg图片 with ThreadPoolExecutor(max_workers=4) as executor: for img in os.listdir("images"): if img.endswith(".jpg"): executor.submit(process_image, f"images/{img}")

4. 进阶应用：表格识别与结构化输出

4.1 基础表格识别

GLM-4v-9b能够理解表格结构，并将其转换为Markdown或CSV格式：

# 识别表格并转换为Markdown table_prompt = """请识别图片中的表格，并按以下要求输出： 1. 转换为标准的Markdown表格格式 2. 保留表头和各列数据 3. 确保数据对齐""" image = Image.open("financial_report.png") response, _ = model.chat(tokenizer, table_prompt, image=image) print(response)

4.2 表格数据分析

结合多轮对话能力，可以直接对识别出的表格数据进行简单分析：

# 第一轮：识别表格 table_prompt = "将此表格转换为Markdown格式" response, history = model.chat(tokenizer, table_prompt, image=image) # 第二轮：分析数据 analysis_prompt = "根据上表，计算第三列数据的平均值" analysis_result, _ = model.chat(tokenizer, analysis_prompt, history=history) print(analysis_result)