当前位置：首页 > news >正文

手把手复现：用NumPy和SciPy从零实现Delong检验（附完整代码与可视化）

news 2026/6/17 16:45:20

从零实现Delong检验：深入解析AUC比较的统计本质与Python实践

在机器学习模型评估中，我们常常需要比较两个模型的性能差异是否具有统计学意义。当评估指标是AUC（Area Under Curve）时，Delong检验提供了一种非参数化的解决方案。本文将带你从数学原理出发，逐步构建一个完整的Delong检验实现，并通过可视化手段让抽象的统计概念变得直观可感。

1. Delong检验的统计基础

Delong检验的核心思想源自Mann-Whitney U统计量，这是一种非参数检验方法，用于比较两个独立样本的分布差异。在AUC比较的场景下，我们实际上是在评估两个模型对正负样本的排序能力差异。

关键数学概念：

U统计量：衡量一个样本中的观测值比另一个样本中的观测值大多少的概率
结构分量（Structural Components）：反映每个样本点对整体AUC估计的贡献
协方差矩阵：捕捉两个模型AUC估计之间的相关性

注意：Delong检验假设预测结果来自相同的样本集，这使得我们可以利用配对设计提高检验效能

2. 构建Delong检验的核心组件

2.1 核函数实现

核函数是Delong检验的基础构件，它实现了Mann-Whitney统计量的计算：

def _kernel(self, x: float, y: float) -> float: """ Mann-Whitney核函数 参数: x: 模型对正样本的预测概率 y: 模型对负样本的预测概率 返回: 0.5 (当x==y), 1 (当y<x), 0 (当y>x) """ if y == x: return 0.5 return float(y < x)

这个简单的函数封装了AUC比较的核心逻辑：计算模型将正样本排在负样本前面的概率。

2.2 结构分量计算

结构分量反映了每个样本点对AUC估计的边际贡献：

def _structural_components(self, X: list, Y: list) -> tuple: """ 计算结构分量V10和V01 参数: X: 正样本预测值列表 Y: 负样本预测值列表 返回: (V10, V01) 元组 """ V10 = [1/len(Y) * sum(self._kernel(x, y) for y in Y) for x in X] V01 = [1/len(X) * sum(self._kernel(x, y) for x in X) for y in Y] return V10, V01

结构分量的统计意义：

V10表示每个正样本对AUC的贡献
V01表示每个负样本对AUC的贡献

3. 协方差矩阵与Z检验

3.1 协方差矩阵估计

Delong检验的关键在于正确估计两个AUC之间的协方差：

def _get_S_entry(self, V_A: list, V_B: list, auc_A: float, auc_B: float) -> float: """ 计算协方差矩阵的单个元素 参数: V_A: 模型A的结构分量 V_B: 模型B的结构分量 auc_A: 模型A的AUC值 auc_B: 模型B的AUC值 返回: 协方差矩阵元素值 """ return 1/(len(V_A)-1) * sum((a-auc_A)*(b-auc_B) for a,b in zip(V_A, V_B))

3.2 Z分数计算

基于协方差矩阵，我们可以计算标准化后的差异分数：

def _z_score(self, var_A: float, var_B: float, covar_AB: float, auc_A: float, auc_B: float) -> float: """ 计算标准化Z分数 参数: var_A: 模型A的方差 var_B: 模型B的方差 covar_AB: 两个模型间的协方差 auc_A: 模型A的AUC auc_B: 模型B的AUC 返回: Z分数 """ denominator = (var_A + var_B - 2*covar_AB)**0.5 return (auc_A - auc_B) / (denominator + 1e-8) # 添加小常数避免除零错误

4. 完整实现与可视化分析

4.1 类结构设计

我们将上述组件整合到一个完整的Python类中：

import numpy as np from scipy import stats from typing import List, Tuple class DelongTest: def __init__(self, preds1: np.ndarray, preds2: np.ndarray, label: np.ndarray, alpha: float = 0.05): """ 初始化Delong检验 参数: preds1: 模型1的预测概率数组 preds2: 模型2的预测概率数组 label: 真实标签数组 (0或1) alpha: 显著性水平 (默认0.05) """ self.preds1 = preds1 self.preds2 = preds2 self.label = label self.alpha = alpha self.z, self.p = self._compute_z_p() # 前面定义的所有方法... def plot_roc_comparison(self): """可视化两个模型的ROC曲线""" from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt fpr1, tpr1, _ = roc_curve(self.label, self.preds1) fpr2, tpr2, _ = roc_curve(self.label, self.preds2) roc_auc1 = auc(fpr1, tpr1) roc_auc2 = auc(fpr2, tpr2) plt.figure(figsize=(8, 6)) plt.plot(fpr1, tpr1, color='blue', label=f'Model 1 (AUC = {roc_auc1:.2f})') plt.plot(fpr2, tpr2, color='red', label=f'Model 2 (AUC = {roc_auc2:.2f})') plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve Comparison') plt.legend(loc="lower right") plt.show()

4.2 可视化中间结果

为了深入理解Delong检验的工作原理，我们可以可视化关键中间变量：

def plot_structural_components(self): """可视化结构分量""" X_A, Y_A = self._group_preds_by_label(self.preds1, self.label) X_B, Y_B = self._group_preds_by_label(self.preds2, self.label) V_A10, V_A01 = self._structural_components(X_A, Y_A) V_B10, V_B01 = self._structural_components(X_B, Y_B) plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(range(len(V_A10)), V_A10, label='Model 1 (Positive)') plt.scatter(range(len(V_B10)), V_B10, label='Model 2 (Positive)') plt.title('V10 Components (Positive Samples)') plt.xlabel('Sample Index') plt.ylabel('Component Value') plt.legend() plt.subplot(1, 2, 2) plt.scatter(range(len(V_A01)), V_A01, label='Model 1 (Negative)') plt.scatter(range(len(V_B01)), V_B01, label='Model 2 (Negative)') plt.title('V01 Components (Negative Samples)') plt.xlabel('Sample Index') plt.ylabel('Component Value') plt.legend() plt.tight_layout() plt.show()

5. 实际应用与案例研究

5.1 使用示例

让我们通过一个具体案例演示Delong检验的应用：

# 生成模拟数据 np.random.seed(42) n_samples = 100 true_labels = np.random.randint(0, 2, size=n_samples) # 模型1预测（随机预测） model1_preds = np.random.uniform(0, 1, size=n_samples) # 模型2预测（有区分能力） model2_preds = np.where(true_labels == 1, np.random.normal(0.7, 0.1, size=n_samples), np.random.normal(0.3, 0.1, size=n_samples)) model2_preds = np.clip(model2_preds, 0, 1) # 执行Delong检验 delong_test = DelongTest(model1_preds, model2_preds, true_labels) print(f"Z-score: {delong_test.z:.4f}, P-value: {delong_test.p:.4f}") # 可视化比较 delong_test.plot_roc_comparison() delong_test.plot_structural_components()