当前位置：首页 > news >正文

告别手动标注！用Python脚本批量处理Labelme生成的JSON文件（附赠清理脚本）

news 2026/6/30 18:10:25

告别手动标注！用Python脚本批量处理Labelme生成的JSON文件（附赠清理脚本）

在计算机视觉项目中，数据标注往往是耗时最长的环节之一。Labelme作为一款开源的图像标注工具，因其灵活性和易用性受到广泛欢迎。然而，当项目规模扩大，面对成百上千个标注文件时，手动检查和清理JSON文件的工作量会变得异常繁重。本文将介绍如何通过Python脚本自动化处理Labelme生成的JSON文件，显著提升标注数据的管理效率。

1. 为什么需要自动化处理Labelme JSON文件

Labelme生成的JSON文件包含了图像标注的所有关键信息，包括图像路径、标注形状、类别标签等。在实际项目中，这些文件往往存在以下常见问题：

无效标注文件：部分图像可能未被标注，但JSON文件仍然存在
标注错误：标注框超出图像边界、标注类别拼写不一致
数据不平衡：某些类别的标注数量远多于其他类别
格式问题：JSON文件结构不符合预期，导致后续训练出错

手动检查这些问题不仅效率低下，而且容易出错。通过Python脚本自动化处理，可以：

快速识别并删除无效标注
自动修正常见标注错误
统计标注数据分布
批量转换标注格式

2. 环境准备与基础脚本

2.1 安装必要依赖

处理Labelme JSON文件主要需要以下Python库：

pip install json numpy opencv-python tqdm

2.2 基础文件操作脚本

以下脚本可以遍历指定目录下的所有JSON文件：

import os import json def process_labelme_json(directory): for filename in os.listdir(directory): if filename.endswith('.json'): filepath = os.path.join(directory, filename) with open(filepath, 'r') as f: data = json.load(f) # 在这里添加处理逻辑

3. 实用数据处理脚本

3.1 自动删除无标注的JSON文件

在实际标注过程中，可能会产生一些没有实际标注的JSON文件。以下脚本可以自动识别并删除这些文件：

def remove_empty_annotations(directory): removed_count = 0 for filename in os.listdir(directory): if filename.endswith('.json'): filepath = os.path.join(directory, filename) with open(filepath, 'r') as f: data = json.load(f) if len(data['shapes']) == 0: # 无标注形状 os.remove(filepath) removed_count += 1 print(f"已删除 {removed_count} 个无标注文件")

3.2 标注框边界检查与修正

标注框超出图像边界是常见问题，可能导致训练时出错。以下脚本可以检测并修正这类问题：

import cv2 def check_bbox_boundaries(directory): for filename in os.listdir(directory): if filename.endswith('.json'): filepath = os.path.join(directory, filename) with open(filepath, 'r') as f: data = json.load(f) image_path = os.path.join(directory, data['imagePath']) img = cv2.imread(image_path) h, w = img.shape[:2] for shape in data['shapes']: if shape['shape_type'] == 'rectangle': points = shape['points'] x1, y1 = points[0] x2, y2 = points[1] # 修正超出边界的坐标 x1 = max(0, min(x1, w-1)) y1 = max(0, min(y1, h-1)) x2 = max(0, min(x2, w-1)) y2 = max(0, min(y2, h-1)) shape['points'] = [[x1, y1], [x2, y2]] # 保存修正后的文件 with open(filepath, 'w') as f: json.dump(data, f, indent=2)

4. 高级数据处理技巧

4.1 标注数据统计分析

了解数据集的标注分布对于平衡训练非常重要。以下脚本可以统计各类别的标注数量：

from collections import defaultdict def count_annotations(directory): label_counts = defaultdict(int) for filename in os.listdir(directory): if filename.endswith('.json'): filepath = os.path.join(directory, filename) with open(filepath, 'r') as f: data = json.load(f) for shape in data['shapes']: label = shape['label'] label_counts[label] += 1 print("标注类别统计:") for label, count in sorted(label_counts.items(), key=lambda x: x[1], reverse=True): print(f"{label}: {count}")

4.2 批量修改标注类别

当需要统一修改某些类别名称时，可以批量处理：

def rename_labels(directory, old_name, new_name): modified_count = 0 for filename in os.listdir(directory): if filename.endswith('.json'): filepath = os.path.join(directory, filename) with open(filepath, 'r') as f: data = json.load(f) modified = False for shape in data['shapes']: if shape['label'] == old_name: shape['label'] = new_name modified = True if modified: modified_count += 1 with open(filepath, 'w') as f: json.dump(data, f, indent=2) print(f"共修改了 {modified_count} 个文件中的 '{old_name}' 为 '{new_name}'")

5. 完整数据处理流程示例

结合上述功能，我们可以构建一个完整的数据处理流程：

def full_pipeline(directory): # 1. 删除无标注文件 remove_empty_annotations(directory) # 2. 检查并修正标注框边界 check_bbox_boundaries(directory) # 3. 统计标注分布 count_annotations(directory) # 4. 示例：批量修改类别名称 # rename_labels(directory, "person", "pedestrian") print("数据处理流程完成")

6. 实用技巧与注意事项

在实际使用这些脚本时，有几点经验值得分享：

备份原始数据：在运行任何批量修改脚本前，务必先备份原始JSON文件
逐步验证：先在小样本上测试脚本，确认无误后再处理全部数据
版本控制：将标注数据纳入版本控制系统，便于追踪变更
日志记录：为脚本添加日志功能，记录所有修改操作

注意：处理大型数据集时，建议使用多进程加速。可以通过Python的multiprocessing模块实现。

以下是一个简单的多进程处理示例：

from multiprocessing import Pool def process_single_file(filename): # 实现单个文件的处理逻辑 pass def parallel_process(directory, num_workers=4): files = [f for f in os.listdir(directory) if f.endswith('.json')] with Pool(num_workers) as p: p.map(process_single_file, files)

7. 扩展功能与自定义开发

根据项目需求，可以进一步扩展脚本功能：

自动分割数据集：按比例随机分割训练集、验证集和测试集
格式转换：将Labelme JSON转换为COCO、YOLO等其他格式
可视化检查：生成标注预览图像，便于人工复查
质量评估：计算标注一致性指标，评估标注质量

以下是一个简单的数据集分割脚本示例：

import random import shutil def split_dataset(directory, train_ratio=0.7, val_ratio=0.2): # 创建输出目录 os.makedirs(os.path.join(directory, 'train'), exist_ok=True) os.makedirs(os.path.join(directory, 'val'), exist_ok=True) os.makedirs(os.path.join(directory, 'test'), exist_ok=True) # 获取所有JSON文件 files = [f for f in os.listdir(directory) if f.endswith('.json')] random.shuffle(files) # 计算分割点 train_end = int(len(files) * train_ratio) val_end = train_end + int(len(files) * val_ratio) # 移动文件到相应目录 for i, filename in enumerate(files): src_json = os.path.join(directory, filename) src_img = os.path.join(directory, os.path.splitext(filename)[0] + '.jpg') if i < train_end: dest_dir = 'train' elif i < val_end: dest_dir = 'val' else: dest_dir = 'test' # 移动JSON文件 shutil.move(src_json, os.path.join(directory, dest_dir, filename)) # 移动对应的图像文件 if os.path.exists(src_img): shutil.move(src_img, os.path.join(directory, dest_dir, os.path.splitext(filename)[0] + '.jpg')) print(f"数据集分割完成: 训练集{train_ratio*100}%, 验证集{val_ratio*100}%, 测试集{(1-train_ratio-val_ratio)*100}%")

在实际项目中，根据团队协作的需要，我们还可以开发更复杂的功能，比如标注进度跟踪、标注质量评估等。这些自动化脚本不仅能节省大量时间，还能减少人为错误，确保数据质量的一致性。

查看全文

http://www.jsqmd.com/news/802034/