当前位置：首页 > news >正文

手把手教你用PlantVillage数据集搭建农作物病害识别模型（Python实战）

news 2026/3/27 1:20:30

手把手教你用PlantVillage数据集搭建农作物病害识别模型（Python实战）

在农业技术快速发展的今天，人工智能正在为传统农业带来革命性的变化。想象一下，农民只需用手机拍下作物叶片的照片，就能立即获得病害诊断结果和防治建议——这正是农作物病害识别模型的应用场景。对于机器学习初学者和农业技术开发者来说，PlantVillage数据集提供了一个绝佳的入门机会，它包含了超过5万张健康与患病作物叶片的高质量图像，覆盖14种作物和26种病害。

本文将带你从零开始，使用Python构建一个实用的农作物病害识别模型。不同于简单的教程，我们会深入探讨数据处理中的实际挑战，分享模型调优的实战技巧，并提供完整的代码实现。无论你是想了解农业AI应用，还是希望掌握一个完整的机器学习项目流程，这篇指南都能为你提供有价值的参考。

1. 环境准备与数据获取

在开始项目前，我们需要搭建合适的开发环境。推荐使用Python 3.8或更高版本，并创建独立的虚拟环境以避免依赖冲突：

python -m venv plant_disease source plant_disease/bin/activate # Linux/Mac # 或者 plant_disease\Scripts\activate # Windows

安装必要的依赖库：

pip install tensorflow opencv-python matplotlib numpy pandas scikit-learn

PlantVillage数据集可以从多个渠道获取。最直接的方式是通过GitHub仓库下载：

import os import urllib.request import zipfile dataset_url = "https://github.com/spMohanty/PlantVillage-Dataset/archive/master.zip" download_path = "plantvillage.zip" # 下载数据集 urllib.request.urlretrieve(dataset_url, download_path) # 解压文件 with zipfile.ZipFile(download_path, 'r') as zip_ref: zip_ref.extractall(".") os.remove(download_path)

数据集解压后，你会看到按作物和病害分类的文件夹结构。例如：

PlantVillage-Dataset/ ├── color/ │ ├── Apple___Apple_scab/ │ ├── Apple___Black_rot/ │ └── ...其他类别 └── grayscale/

重要提示：数据集包含彩色和灰度两个版本，我们通常使用彩色图像以获得更好的识别效果。整个数据集约3.5GB，确保你有足够的存储空间。

注意：如果下载速度较慢，可以考虑使用Kaggle上的镜像版本，或者只下载你感兴趣的特定作物类别。

2. 数据探索与预处理

拿到数据后，我们需要先了解其特点和分布。PlantVillage数据集包含54,305张图像，涵盖14种作物（苹果、樱桃、玉米等）和26种病害，此外还有健康植株的图像。

让我们用Python进行初步的数据分析：

import os import matplotlib.pyplot as plt from collections import defaultdict # 统计各类别图像数量 data_path = "PlantVillage-Dataset/color" categories = os.listdir(data_path) category_counts = defaultdict(int) for category in categories: category_path = os.path.join(data_path, category) category_counts[category] = len(os.listdir(category_path)) # 绘制类别分布 plt.figure(figsize=(15, 6)) plt.bar(category_counts.keys(), category_counts.values()) plt.xticks(rotation=90) plt.title("PlantVillage Dataset Category Distribution") plt.ylabel("Number of Images") plt.tight_layout() plt.show()

这个分析会揭示数据集中存在的一个关键问题：类别不平衡。某些病害的样本量可能比其他类别多出数倍，这会影响模型的训练效果。我们将在后续步骤中解决这个问题。

图像预处理是模型性能的关键。以下是标准的预处理流程：

图像标准化：调整图像大小至统一尺寸（通常224x224或299x299）
数据增强：通过旋转、翻转等操作增加数据多样性
归一化：将像素值缩放到0-1范围
类别编码：将文本标签转换为数值

使用TensorFlow的ImageDataGenerator可以高效实现这些步骤：

from tensorflow.keras.preprocessing.image import ImageDataGenerator # 设置数据增强参数 train_datagen = ImageDataGenerator( rescale=1./255, rotation_range=40, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest', validation_split=0.2 # 保留20%数据用于验证 ) # 创建训练和验证数据流 train_generator = train_datagen.flow_from_directory( data_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training' ) val_generator = train_datagen.flow_from_directory( data_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation' )

3. 构建病害识别模型

对于图像分类任务，我们通常使用卷积神经网络(CNN)。考虑到PlantVillage数据集的规模和复杂度，直接使用预训练模型进行迁移学习是最佳选择。这里我们以EfficientNetB0为例：

from tensorflow.keras.applications import EfficientNetB0 from tensorflow.keras import layers, models # 加载预训练模型（不包括顶层） base_model = EfficientNetB0( input_shape=(224, 224, 3), include_top=False, weights='imagenet' ) # 冻结基础模型权重 base_model.trainable = False # 添加自定义顶层 model = models.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dense(256, activation='relu'), layers.Dropout(0.5), layers.Dense(len(train_generator.class_indices), activation='softmax') ]) # 编译模型 model.compile( optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'] ) model.summary()

这个架构的关键点在于：

使用预训练的EfficientNetB0作为特征提取器
添加全局平均池化层减少参数数量
包含Dropout层防止过拟合
输出层节点数等于类别数量

为了处理前面提到的类别不平衡问题，我们可以使用类别权重：

from sklearn.utils.class_weight import compute_class_weight import numpy as np # 计算类别权重 class_weights = compute_class_weight( 'balanced', classes=np.unique(train_generator.classes), y=train_generator.classes ) class_weight_dict = dict(enumerate(class_weights))

4. 模型训练与评估

现在可以开始训练模型了。我们将使用回调函数来实现早停和模型保存：

from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping callbacks = [ EarlyStopping(patience=5, restore_best_weights=True), ModelCheckpoint('best_model.h5', save_best_only=True) ] history = model.fit( train_generator, steps_per_epoch=train_generator.samples // train_generator.batch_size, validation_data=val_generator, validation_steps=val_generator.samples // val_generator.batch_size, epochs=30, callbacks=callbacks, class_weight=class_weight_dict )

训练完成后，我们可以评估模型性能并绘制学习曲线：

# 评估模型 val_loss, val_acc = model.evaluate(val_generator) print(f"Validation Accuracy: {val_acc*100:.2f}%") # 绘制训练历史 plt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1) plt.plot(history.history['accuracy'], label='Train Accuracy') plt.plot(history.history['val_accuracy'], label='Validation Accuracy') plt.title('Accuracy over epochs') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend() plt.subplot(1, 2, 2) plt.plot(history.history['loss'], label='Train Loss') plt.plot(history.history['val_loss'], label='Validation Loss') plt.title('Loss over epochs') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend() plt.tight_layout() plt.show()

为了更全面地评估模型，我们可以生成混淆矩阵和分类报告：

from sklearn.metrics import classification_report, confusion_matrix import seaborn as sns # 获取真实标签和预测结果 y_true = val_generator.classes y_pred = model.predict(val_generator).argmax(axis=1) # 分类报告 print(classification_report(y_true, y_pred, target_names=val_generator.class_indices.keys())) # 混淆矩阵 cm = confusion_matrix(y_true, y_pred) plt.figure(figsize=(12, 10)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=val_generator.class_indices.keys(), yticklabels=val_generator.class_indices.keys()) plt.title('Confusion Matrix') plt.ylabel('True Label') plt.xlabel('Predicted Label') plt.xticks(rotation=90) plt.yticks(rotation=0) plt.show()

5. 模型优化与部署

获得基础模型后，我们可以通过微调进一步提升性能。微调是指解冻部分基础模型层并进行额外训练：

# 解冻顶层20个层 base_model.trainable = True for layer in base_model.layers[:-20]: layer.trainable = False # 重新编译模型（使用更小的学习率） model.compile( optimizer=tf.keras.optimizers.Adam(1e-5), loss='categorical_crossentropy', metrics=['accuracy'] ) # 继续训练 history_fine = model.fit( train_generator, steps_per_epoch=train_generator.samples // train_generator.batch_size, validation_data=val_generator, validation_steps=val_generator.samples // val_generator.batch_size, epochs=10, callbacks=callbacks, class_weight=class_weight_dict )

模型优化后，我们可以将其部署为实用的应用程序。以下是使用Flask创建简单Web应用的示例：

from flask import Flask, request, render_template, jsonify from tensorflow.keras.models import load_model from tensorflow.keras.preprocessing import image import numpy as np import os app = Flask(__name__) model = load_model('best_model.h5') class_indices = {v: k for k, v in train_generator.class_indices.items()} @app.route('/') def home(): return render_template('index.html') @app.route('/predict', methods=['POST']) def predict(): if 'file' not in request.files: return jsonify({'error': 'No file uploaded'}) file = request.files['file'] if file.filename == '': return jsonify({'error': 'No file selected'}) # 保存并处理图像 img_path = 'temp.jpg' file.save(img_path) img = image.load_img(img_path, target_size=(224, 224)) img_array = image.img_to_array(img) img_array = np.expand_dims(img_array, axis=0) / 255.0 # 预测 pred = model.predict(img_array) pred_class = class_indices[np.argmax(pred)] confidence = float(np.max(pred)) return jsonify({ 'class': pred_class, 'confidence': confidence }) if __name__ == '__main__': app.run(debug=True)

对应的HTML模板(templates/index.html)可以包含简单的文件上传表单和结果显示区域。