当前位置：首页 > news >正文

数学建模竞赛数据预处理全攻略：从清洗到增强的完整流程与代码实践

news 2026/6/18 0:37:13

数学建模竞赛数据预处理全攻略：从清洗到增强的完整流程与代码实践

在数学建模竞赛（如美赛C题）中，原始数据往往存在缺失、异常、量纲不一等问题。高质量的数据预处理不仅是模型成功的基石，更是论文中展示严谨科学态度的重要环节。本文将系统梳理从基础清洗到高级增强的全套预处理技术，并提供可直接复用的代码模板，助你在有限时间内构建坚实的数据基础。

一、数据清洗：处理缺失值与异常点

竞赛数据通常具有“量大、杂乱、有缺失、有异常”的特点。处理缺失值是第一步，使用 df.info() 和 df.isnull().sum() 快速查看缺失情况是关键。针对不同场景，有四种主流策略：

直接删除：适用于缺失极少或特征无关紧要的情况。代码简洁高效：

# 删除包含缺失值的行
df_clean = df.dropna(axis=0)
# 删除缺失值过多的列 (比如超过50%都是空的)
df_clean = df.dropna(thresh=len(df)*0.5, axis=1)

。

统计值填充：最常用方法。数值型用均值或中位数（抗极值），分类变量用众数。实现代码：

# 用均值填充
df['Age'] = df['Age'].fillna(df['Age'].mean())
# 用中位数填充
df['Income'] = df['Income'].fillna(df['Income'].median())
# 用前一个数据填充 (适合时间序列)
df['Price'] = df['Price'].fillna(method='ffill')

。

插值法：时间序列数据的必备技术。当数据点按时间顺序缺失时，线性或样条插值比简单均值更合理：

# 线性插值，自动补全中间的趋势
df['Temperature'] = df['Temperature'].interpolate(method='linear')
# 多项式插值 (更平滑，适合非线性数据)
df['Temperature'] = df['Temperature'].interpolate(method='polynomial', order=2)

。

KNN填充：高级加分项。基于相似样本进行填充，适合特征间存在相关性的数据集：

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

。

论文话术：
“Due to the incompleteness of the raw data, we employed the Linear Interpolation method (or KNN Imputation) to fill the missing values, ensuring the continuity of the time-series data and preserving the sample size.”

异常值会严重扭曲模型结果。推荐使用稳健的箱线图法（IQR）进行检测：

Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
# 定义异常值范围
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# 处理方式1：删除
df_clean = df[(df['feature'] >= lower_bound) & (df['feature'] <= upper_bound)]
# 处理方式2：盖帽法 (Capping/Winsorization) - 把超过上限的数强行变成上限值
df['feature'] = df['feature'].clip(lower_bound, upper_bound)

。对于近似正态分布的数据，也可采用3σ原则：

mean = df['feature'].mean()
std = df['feature'].std()
threshold = 3
# 筛选出异常值
outliers = df[(df['feature'] - mean).abs() > threshold * std]
# 剔除异常值
df_clean = df[(df['feature'] - mean).abs() <= threshold * std]

。

二、数据变换：标准化、归一化与编码

数据尺度不一致会严重影响基于距离的模型（如K-Means、KNN、神经网络）。主要变换方法包括：

Min-Max归一化：将数据压缩到[0,1]区间，保留原始分布形状：

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

。

Z-Score标准化：最通用的方法，将数据变为均值为0、标准差为1的分布，对异常值相对不敏感：

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

。

对于分类特征，计算机需要数字输入：

标签编码（Label Encoding）：适用于有序类别（如教育程度）。

独热编码（One-Hot Encoding）：适用于无序类别（如城市、颜色），但需警惕维度爆炸：

# 自动生成 dummy variables (例如 City_NY, City_London)
df = pd.get_dummies(df, columns=['City'], prefix='City')

。

论文话术：
“We utilized the Interquartile Range (IQR) method to identify outliers. To prevent information loss, we applied Winsorization (capping) to limit extreme values within the reasonable range $[Q 1 - 1.5 I QR, Q 3 + 1.5 I QR]$ .”

[AFFILIATE_SLOT_1]

三、特征分析与降维：从相关性到主成分

预处理后，特征分析是连接数据与模型的桥梁。通过相关性热力图可以识别多重共线性并筛选关键特征：

import seaborn as sns
import matplotlib.pyplot as plt
# 计算相关系数矩阵
corr_matrix = df.corr()
# 绘制热力图
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

。

论文话术：
“We calculated the Pearson correlation coefficient matrix to investigate the relationships between variables. As shown in Figure X (Heatmap), ‘Feature A’ and ‘Feature B’ exhibit a strong positive correlation ( $r > 0.9$ ), indicating multicollinearity. Thus, we removed ‘Feature B’ to simplify the model.”

当特征过多或相关性过高时，主成分分析（PCA）是有效的降维工具。⚠️ 关键步骤：进行PCA前必须标准化数据。代码模板如下：

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# 1. 数据标准化 (必须!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df) # 假设 df 全是数值特征
# 2. 训练 PCA
pca = PCA(n_components=0.95) # 设定为 0.95 表示保留 95% 的信息量，自动计算需要几个主成分
# 或者 pca = PCA(n_components=2) # 强制降维到 2 维，方便画图
X_pca = pca.fit_transform(X_scaled)
# 3. 结果分析
print(f"降维后的特征数量: {pca.n_components_}")
print(f"各主成分解释的方差比例: {pca.explained_variance_ratio_}")
# 4. 绘制碎石图 (Scree Plot) - 论文必画
# 展示选多少个主成分合适 (通常看拐点)
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'bo-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot for PCA')
plt.grid(True)
plt.show()

。

论文话术：
“To reduce the dimensionality and eliminate multicollinearity, we applied Principal Component Analysis (PCA). We standardized the data first. As illustrated in the Scree Plot, the first 3 principal components explain over 95% of the total variance, effectively representing the original dataset.”

四、特征工程：创造信息而不仅是处理

优秀的建模者善于创造新特征。对于时间序列，可以从日期列 2023-01-01 提取 Month, Weekday, Is_Weekend 等。还可以创建交互特征（如 人均GDP = GDP / 人口 = 人口 * GDP）或变化率特征（增长率 = (今年 - 去年) / 去年）。示例代码：

# 时间特征示例
df['Date'] = pd.to_datetime(df['Date_Str'])
df['Month'] = df['Date'].dt.month
df['Is_Weekend'] = df['Date'].dt.dayofweek.apply(lambda x: 1 if x >= 5 else 0)

。

对于预测问题，滞后（Lag）和滑动窗口（Rolling）特征至关重要：

# 把 'Sales' 往下平移1天 (作为特征输入)
df['Sales_Lag1'] = df['Sales'].shift(1)
# 计算过去3天的移动平均
df['Sales_Rolling_Mean'] = df['Sales'].rolling(window=3).mean()

。

五、处理不平衡数据与高级增强技术

在分类问题中，类别不平衡（如欺诈检测中正常样本远多于欺诈样本）会导致模型失效。SMOTE 通过合成少数类样本来解决此问题，是公认的有效技术：

# 需要安装: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
X = df.drop('Label', axis=1)
y = df['Label']
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
# 现在 X_res 和 y_res 中的正负样本比例是 1:1 了

或

from imblearn.over_sampling import SMOTE
# 略，见上一模块

。

当数据量极少时，可以考虑以下高级数据增强技术，但必须在论文中给出合理解释：

伪标签法（Pseudo-Labeling）：利用模型对无标签数据的高置信度预测来扩充训练集：

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# 假设 X_train, y_train 是有标签数据
# X_test 是无标签数据
# 1. 第一轮训练
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# 2. 预测无标签数据 (获取概率)
probs = model.predict_proba(X_test)  # 得到 [P(0), P(1)]
preds = model.predict(X_test)
# 3. 筛选高置信度样本 (比如概率 > 0.9 或 < 0.1)
# 假设是二分类，取 max(prob) > 0.9 的行
high_confidence_idx = np.where(np.max(probs, axis=1) > 0.9)[0]
if len(high_confidence_idx) > 0:
# 提取这些“伪造”的数据
X_pseudo = X_test.iloc[high_confidence_idx]
y_pseudo = preds[high_confidence_idx]
# 4. 合并到训练集
X_train_new = pd.concat([X_train, X_pseudo], axis=0)
y_train_new = np.concatenate([y_train, y_pseudo], axis=0)
# 5. 第二轮训练 (通常效果会更好)
model_final = RandomForestClassifier(n_estimators=100)
model_final.fit(X_train_new, y_train_new)
print(f"增加了 {len(X_pseudo)} 条伪标签数据")

。

论文话术：
“To fully utilize the unlabeled data, we adopted a Pseudo-Labeling strategy. We iteratively added high-confidence predictions (probability > 0.9) from the test set into the training set, essentially transforming the problem into a Semi-Supervised Learning task to improve model generalization.”

噪声注入（Gaussian Noise Injection）：通过添加微小随机噪声增加数据多样性，提升模型鲁棒性：

def add_gaussian_noise(data, noise_level=0.01):
# data: DataFrame 或 numpy array
noise = np.random.normal(0, noise_level, data.shape)
return data + noise
# 原始数据
X_train_augmented = X_train.copy()
y_train_augmented = y_train.copy()
# 造一批带噪声的数据
X_noisy = add_gaussian_noise(X_train, noise_level=0.02) # 2% 的波动
# 合并
X_final = pd.concat([X_train, X_noisy], axis=0)
y_final = pd.concat([y_train, y_train], axis=0) # 标签不变

。

更前沿的如CTGAN等生成对抗网络，因训练复杂、解释性差，不推荐在时间紧迫的竞赛中使用。

[AFFILIATE_SLOT_2]

六、完整流程与竞赛应用建议

为方便竞赛中快速应用，这里提供一个整合了关键步骤的预处理框架函数：

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
def preprocess_data(df):
# 1. 备份原始数据
data = df.copy()
# 2. 缺失值处理 (这里演示用中位数填充)
# numeric_cols = data.select_dtypes(include=[np.number]).columns
# data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].median())
# 3. 异常值处理 (IQR 盖帽法)
# for col in ['Key_Feature_1', 'Key_Feature_2']:
#     Q1 = data[col].quantile(0.25)
#     Q3 = data[col].quantile(0.75)
#     IQR = Q3 - Q1
#     lower = Q1 - 1.5 * IQR
#     upper = Q3 + 1.5 * IQR
#     data[col] = data[col].clip(lower, upper)
# 4. 离散变量编码 (One-Hot)
# data = pd.get_dummies(data, columns=['Category_Col'])
# 5. 标准化 (Z-Score)
# scaler = StandardScaler()
# cols_to_scale = ['Feature_1', 'Feature_2']
# data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])
return data
# 使用
# df_final = preprocess_data(df_raw)

。在论文写作中，数据预处理部分应清晰阐述每一步的选择理由。例如，选择IQR而非3σ原则是因为数据分布未知；使用SMOTE是因为存在类别不平衡问题。

核心建议：始终进行对比实验。展示使用某种预处理技术（如数据增强）前后的模型性能对比，这本身就是强有力的验证，能为你的论文增添亮点。

在这里插入图片描述

掌握从基础清洗（缺失值、异常值）、数据变换（标准化、编码）到高级操作（特征工程、降维、不平衡处理）的完整流程，是应对数学建模竞赛中复杂数据挑战的关键。本文提供的策略与代码模板旨在构建一个稳健、可解释且高效的预处理流水线，帮助你将杂乱数据转化为高质量的特征集，为后续建模打下坚实基础。记住，清晰的数据处理逻辑与文档，与最终的模型精度同等重要。

查看全文

http://www.jsqmd.com/news/646620/