当前位置：首页 > news >正文

PDF-Extract-Kit开发者文档：API参考指南

news 2026/3/26 22:09:28

PDF-Extract-Kit开发者文档：API参考指南

1. 概述

1.1 工具简介

PDF-Extract-Kit 是一个基于深度学习的PDF智能内容提取工具箱，由开发者“科哥”进行二次开发与功能整合。该工具专为科研、教育、出版等场景设计，支持对PDF文档中的关键元素（如文本、公式、表格、图像）进行高精度识别与结构化解析。

其核心价值在于： -多模态融合处理：集成布局检测、OCR、公式识别、表格解析等多项AI能力 -模块化架构设计：各功能独立运行又可协同工作，便于二次开发和系统集成 -WebUI + API 双模式：既提供可视化操作界面，也开放底层API接口供程序调用

💡 本指南聚焦于API 接口使用说明，适用于希望将 PDF-Extract-Kit 集成到自有系统的开发者。

2. 系统架构与运行环境

2.1 整体架构

PDF-Extract-Kit 采用前后端分离架构：

[前端 WebUI] ←→ [FastAPI 后端服务] ←→ [AI 模型引擎] ↓ [输出结果管理]

所有功能模块通过统一的 RESTful API 提供服务，模型推理基于 PyTorch 实现，OCR 使用 PaddleOCR，目标检测使用 YOLOv8 架构。

2.2 运行依赖

组件	版本要求
Python	≥3.8
PyTorch	≥1.12
CUDA	可选（推荐11.7+）
FastAPI	≥0.68
Uvicorn	≥0.15

2.3 启动方式（API模式）

# 方式一：使用脚本启动（含API服务） bash start_api.sh # 方式二：直接运行API服务 uvicorn api.server:app --host 0.0.0.0 --port 8000 --reload

服务默认监听http://localhost:8000，Swagger 文档可通过http://localhost:8000/docs访问。

3. 核心API接口详解

3.1 布局检测 API

功能说明

调用 YOLO 模型分析文档页面结构，识别标题、段落、图片、表格等区域。

请求地址

POST /api/v1/layout-detect

请求参数（JSON）

{ "file_path": "/path/to/input.pdf", "img_size": 1024, "conf_thres": 0.25, "iou_thres": 0.45, "output_dir": "./outputs/layout_detection" }

参数	类型	必填	默认值	说明
file_path	string	是	-	输入文件路径（PDF或图像）
img_size	int	否	1024	图像缩放尺寸
conf_thres	float	否	0.25	置信度阈值（0~1）
iou_thres	float	否	0.45	IOU合并阈值
output_dir	string	否	./outputs/layout_detection	输出目录

返回结果示例

{ "status": "success", "message": "Layout detection completed.", "data": { "page_count": 1, "results": [ { "page": 1, "elements": [ { "type": "text", "bbox": [100, 200, 300, 250], "confidence": 0.92 }, { "type": "table", "bbox": [150, 400, 500, 600], "confidence": 0.88 } ] } ], "visual_path": "./outputs/layout_detection/page_1_layout.jpg", "json_path": "./outputs/layout_detection/result.json" } }

3.2 公式检测 API

功能说明

定位文档中数学公式的物理位置，区分 inline（行内）与 display（独立）类型。

请求地址

POST /api/v1/formula-detect

请求参数（JSON）

{ "file_path": "/path/to/document.pdf", "img_size": 1280, "conf_thres": 0.25, "iou_thres": 0.45, "output_dir": "./outputs/formula_detection" }

参数	类型	必填	默认值	说明
file_path	string	是	-	支持 PDF 或单张图像
img_size	int	否	1280	高分辨率利于小公式识别
conf_thres	float	否	0.25	建议不低于0.15避免漏检
iou_thres	float	否	0.45	控制重叠框合并
output_dir	string	否	./outputs/formula_detection	自定义输出路径

返回结果示例

{ "status": "success", "data": { "total_formulas": 6, "pages": [ { "page": 1, "formulas": [ { "id": 1, "type": "display", "bbox": [200, 300, 400, 350], "confidence": 0.91 } ] } ], "visual_path": "./outputs/formula_detection/page_1_formula.jpg" } }

3.3 公式识别 API

功能说明

将公式图像转换为 LaTeX 表达式，支持批量处理多个公式裁剪图。

请求地址

POST /api/v1/formula-recognize

请求参数（JSON）

{ "image_dir": "./cropped_formulas/", "batch_size": 1, "output_dir": "./outputs/formula_recognition" }

参数	类型	必填	默认值	说明
image_dir	string	是	-	包含公式图像的文件夹路径
batch_size	int	否	1	批处理大小（显存受限时设为1）
output_dir	string	否	./outputs/formula_recognition	结果保存路径

返回结果示例

{ "status": "success", "data": { "count": 3, "results": [ { "filename": "eq_1.png", "latex": "E = mc^2" }, { "filename": "eq_2.png", "latex": "\\sum_{i=1}^{n} x_i = \\frac{n(n+1)}{2}" } ], "output_file": "./outputs/formula_recognition/results.txt" } }

3.4 OCR文字识别 API

功能说明

使用 PaddleOCR 引擎提取图像中文本内容，支持中英文混合识别。

请求地址

POST /api/v1/ocr

请求参数（JSON）

{ "files": ["/img/page1.jpg", "/img/page2.jpg"], "lang": "ch", "draw_boxes": true, "output_dir": "./outputs/ocr" }

参数	类型	必填	默认值	说明
files	array[string]	是	-	文件路径列表
lang	string	否	ch	ch（中英）、en（英文）
draw_boxes	boolean	否	false	是否生成带框标注图
output_dir	string	否	./outputs/ocr	输出目录

返回结果示例

{ "status": "success", "data": [ { "file": "page1.jpg", "text_lines": [ "摘要：本文提出一种新的方法", "关键词：自然语言处理，OCR" ], "visual_path": "./outputs/ocr/page1_annotated.jpg" } ] }

3.5 表格解析 API

功能说明

识别表格结构并导出为 LaTeX / HTML / Markdown 格式。

请求地址

POST /api/v1/table-parse

请求参数（JSON）

{ "file_path": "/docs/paper.pdf", "format": "markdown", "output_dir": "./outputs/table_parsing" }

参数	类型	必填	默认值	说明
file_path	string	是	-	输入文件路径
format	string	否	markdown	markdown/html/latex
output_dir	string	否	./outputs/table_parsing	输出路径

返回结果示例

{ "status": "success", "data": { "tables_found": 2, "results": [ { "page": 3, "format": "markdown", "content": "| 年份 | 销量 |\n|------|------|\n| 2021 | 120 |", "output_path": "./outputs/table_parsing/table_1.md" } ] } }

4. 开发者实践建议

4.1 批量自动化处理流程

结合多个API实现全自动PDF信息抽取流水线：

import requests import json def extract_paper_data(pdf_path): # Step 1: 布局检测 resp = requests.post("http://localhost:8000/api/v1/layout-detect", json={ "file_path": pdf_path }) layout = resp.json() # Step 2: 提取表格页并解析 table_pages = [e['page'] for e in layout['data']['results'] if e['type'] == 'table'] for page in table_pages: requests.post("http://localhost:8000/api/v1/table-parse", json={ "file_path": f"{pdf_path}[{page-1}]", # PDF分页索引从0开始 "format": "markdown" }) # Step 3: 公式识别 formula_resp = requests.post("http://localhost:8000/api/v1/formula-detect", json={ "file_path": pdf_path }) print("公式总数:", formula_resp.json()['data']['total_formulas'])

4.2 性能优化技巧

场景	建议
显存不足	降低`img_size`，设置`batch_size=1`
处理速度慢	关闭不必要的可视化输出
小字体漏检	提高`img_size`至 1280 以上
复杂表格错乱	调整`conf_thres`到 0.3~0.4 提升准确性

4.3 错误码说明

code	含义	解决方案
400	参数错误	检查必填字段和格式
404	文件未找到	确认路径是否存在
500	内部错误	查看后端日志排查模型加载问题
503	模型未就绪	等待模型初始化完成