当前位置：首页 > news >正文

DeepSeek-OCR-2入门指南：如何导出训练数据集用于自建OCR微调任务

news 2026/3/26 18:49:54

DeepSeek-OCR-2入门指南：如何导出训练数据集用于自建OCR微调任务

1. 了解DeepSeek-OCR-2

DeepSeek-OCR-2是DeepSeek团队推出的新一代OCR识别模型，它采用了创新的DeepEncoder V2技术。这个模型的最大特点是能够根据图像内容智能重排识别顺序，而不是传统OCR那样机械地从左到右扫描。

想象一下，传统OCR就像是一个只会按顺序读书的小学生，而DeepSeek-OCR-2则像是一个经验丰富的编辑，能够理解文档的结构和含义，智能地选择最佳的识别路径。这种创新方法让模型在保持高压缩效率的同时，在各项测试指标上都取得了突破性进展。

在实际使用中，这个模型只需要256到1120个视觉标记就能处理复杂的文档页面，识别准确率在权威评测中达到了91.09%的高分。这意味着它不仅能准确识别文字，还能很好地处理表格、图表等复杂文档元素。

2. 环境准备与快速部署

2.1 系统要求

在开始之前，确保你的系统满足以下基本要求：

操作系统：Linux Ubuntu 18.04+ 或 Windows 10+
Python版本：3.8或更高版本
内存：至少16GB RAM
显卡：NVIDIA GPU（推荐RTX 3080或更高）
存储空间：至少20GB可用空间

2.2 一键安装步骤

打开终端，依次执行以下命令完成环境搭建：

# 创建虚拟环境 python -m venv ocr_env source ocr_env/bin/activate # Linux/Mac # 或者 ocr_env\Scripts\activate # Windows # 安装核心依赖 pip install torch torchvision torchaudio pip install vllm gradio transformers pip install deepseek-ocr

安装过程通常需要5-10分钟，具体时间取决于你的网络速度和硬件配置。如果遇到网络问题，可以考虑使用国内镜像源加速下载。

3. 快速上手体验

3.1 启动Web界面

完成安装后，让我们先快速体验一下模型的基本功能。创建一个启动脚本：

# start_ocr.py import gradio as gr from deepseek_ocr import DeepSeekOCR # 初始化模型 model = DeepSeekOCR() def recognize_document(file_path): """文档识别函数""" try: # 调用模型进行识别 result = model.recognize(file_path) return result.text except Exception as e: return f"识别失败: {str(e)}" # 创建Gradio界面 interface = gr.Interface( fn=recognize_document, inputs=gr.File(label="上传PDF文档"), outputs=gr.Textbox(label="识别结果"), title="DeepSeek-OCR-2 文档识别" ) # 启动服务 interface.launch(server_name="0.0.0.0", server_port=7860)

运行这个脚本后，在浏览器中打开http://localhost:7860就能看到Web界面了。

3.2 第一次识别体验

在Web界面中：

点击"上传PDF文档"按钮选择你的测试文件
点击"提交"开始识别
等待几秒钟后就能看到识别结果

第一次加载模型可能需要一些时间（通常1-3分钟），这是因为需要将模型加载到内存中。后续的识别速度会快很多，一般文档只需要几秒钟就能完成识别。

4. 导出训练数据集

4.1 准备导出环境

在进行数据集导出前，我们需要先准备一些额外的工具：

# 安装数据导出相关依赖 pip install pandas numpy pillow pip install xmltodict # 用于处理标注文件格式

4.2 批量导出数据集

下面是一个实用的批量导出脚本，可以处理整个文件夹的文档：

# export_dataset.py import os import json from pathlib import Path from deepseek_ocr import DeepSeekOCR class DatasetExporter: def __init__(self): self.model = DeepSeekOCR() self.output_dir = "ocr_training_data" # 创建输出目录 os.makedirs(self.output_dir, exist_ok=True) os.makedirs(f"{self.output_dir}/images", exist_ok=True) os.makedirs(f"{self.output_dir}/annotations", exist_ok=True) def export_single_document(self, pdf_path, doc_id): """导出单个文档的数据""" try: # 识别文档 result = self.model.recognize(pdf_path) # 保存图像和标注 self._save_results(result, doc_id) return True except Exception as e: print(f"导出失败 {pdf_path}: {e}") return False def _save_results(self, result, doc_id): """保存识别结果""" # 保存文本标注 annotation_data = { "doc_id": doc_id, "text": result.text, "bboxes": result.bboxes, # 文本框坐标信息 "confidence": result.confidence_scores } with open(f"{self.output_dir}/annotations/{doc_id}.json", "w", encoding="utf-8") as f: json.dump(annotation_data, f, ensure_ascii=False, indent=2) # 保存图像（如果有提取图像） if hasattr(result, 'images'): for i, img in enumerate(result.images): img.save(f"{self.output_dir}/images/{doc_id}_{i}.png") def batch_export(self, input_folder): """批量导出文件夹中的所有文档""" pdf_files = list(Path(input_folder).glob("*.pdf")) print(f"找到 {len(pdf_files)} 个PDF文件") success_count = 0 for i, pdf_file in enumerate(pdf_files): print(f"处理第 {i+1}/{len(pdf_files)} 个文件: {pdf_file.name}") if self.export_single_document(str(pdf_file), f"doc_{i}"): success_count += 1 print(f"导出完成！成功处理 {success_count}/{len(pdf_files)} 个文件") # 使用示例 if __name__ == "__main__": exporter = DatasetExporter() exporter.batch_export("./documents") # 你的文档文件夹路径

4.3 数据集格式说明

导出的数据集包含以下结构：

ocr_training_data/ ├── images/ # 文档图像 │ ├── doc_0_0.png │ ├── doc_0_1.png │ └── ... ├── annotations/ # 标注文件 │ ├── doc_0.json │ ├── doc_1.json │ └── ... └── dataset_info.json # 数据集信息

每个标注JSON文件包含以下信息：

{ "doc_id": "doc_0", "text": "完整的识别文本内容...", "bboxes": [ {"x": 100, "y": 200, "width": 300, "height": 50, "text": "段落文字"}, ... ], "confidence": [0.95, 0.87, ...] }

这种格式兼容大多数OCR训练框架，可以直接用于模型微调。

5. 数据后处理与质量检查

5.1 数据清洗脚本

导出后的数据可能需要进行一些清洗和处理：

# data_cleaner.py import json import re from pathlib import Path class DataCleaner: @staticmethod def clean_text(text): """清理文本数据""" # 移除多余的空格和换行 text = re.sub(r'\s+', ' ', text).strip() # 处理常见的OCR错误 corrections = { '0': 'O', '1': 'I', '5': 'S', # 常见数字字母混淆 '|': 'I', '\\': '', '/': '' } for wrong, correct in corrections.items(): text = text.replace(wrong, correct) return text @staticmethod def filter_low_confidence(data, threshold=0.8): """过滤低置信度的识别结果""" cleaned_data = data.copy() cleaned_text = [] for i, (char, confidence) in enumerate(zip(data['text'], data['confidence'])): if confidence >= threshold: cleaned_text.append(char) else: cleaned_text.append('?') # 标记低置信度字符 cleaned_data['text'] = ''.join(cleaned_text) return cleaned_data def process_entire_dataset(data_dir): """处理整个数据集""" annotation_dir = Path(data_dir) / "annotations" for json_file in annotation_dir.glob("*.json"): with open(json_file, 'r', encoding='utf-8') as f: data = json.load(f) # 清理数据 data['text'] = DataCleaner.clean_text(data['text']) data = DataCleaner.filter_low_confidence(data) # 保存清理后的数据 with open(json_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=2) # 使用示例 process_entire_dataset("ocr_training_data")

5.2 质量检查工具

为了确保导出数据的质量，建议运行以下检查脚本：

# quality_check.py import json from pathlib import Path def check_dataset_quality(data_dir): """检查数据集质量""" annotation_dir = Path(data_dir) / "annotations" image_dir = Path(data_dir) / "images" issues = [] # 检查文件完整性 for json_file in annotation_dir.glob("*.json"): doc_id = json_file.stem # 检查对应的图像文件 image_files = list(image_dir.glob(f"{doc_id}*.png")) if not image_files: issues.append(f"缺少图像文件: {doc_id}") # 检查标注文件内容 with open(json_file, 'r', encoding='utf-8') as f: data = json.load(f) if not data.get('text', '').strip(): issues.append(f"空文本内容: {doc_id}") if 'confidence' in data and len(data['confidence']) != len(data.get('text', '')): issues.append(f"置信度数组长度不匹配: {doc_id}") # 输出检查结果 if issues: print(f"发现 {len(issues)} 个问题:") for issue in issues: print(f" - {issue}") else: print("数据集质量检查通过！") return len(issues) == 0 # 运行检查 check_dataset_quality("ocr_training_data")

6. 进阶使用技巧

6.1 自定义导出格式

如果你需要特定的数据格式，可以修改导出脚本：

def export_to_coco_format(result, doc_id): """导出为COCO格式""" coco_data = { "images": [], "annotations": [], "categories": [{"id": 1, "name": "text"}] } # 添加图像信息 for i, img in enumerate(result.images): img_info = { "id": f"{doc_id}_{i}", "width": img.width, "height": img.height, "file_name": f"{doc_id}_{i}.png" } coco_data["images"].append(img_info) # 添加标注信息 for j, bbox in enumerate(result.bboxes): annotation = { "id": j, "image_id": f"{doc_id}_{bbox['page_index']}", "category_id": 1, "bbox": [bbox['x'], bbox['y'], bbox['width'], bbox['height']], "area": bbox['width'] * bbox['height'], "text": bbox['text'] } coco_data["annotations"].append(annotation) return coco_data

6.2 增量导出策略

对于大量文档，建议使用增量导出：

def incremental_export(input_folder, output_dir, resume=True): """增量导出，支持断点续传""" processed_files = set() if resume: # 加载已处理文件列表 try: with open(f"{output_dir}/processed_files.txt", "r") as f: processed_files = set(f.read().splitlines()) except FileNotFoundError: pass pdf_files = [f for f in Path(input_folder).glob("*.pdf") if f.name not in processed_files] for pdf_file in pdf_files: # 处理文件... # 记录已处理文件 with open(f"{output_dir}/processed_files.txt", "a") as f: f.write(f"{pdf_file.name}\n")