当前位置：首页 > news >正文

模块三-数据清洗与预处理——13. 缺失值处理（下）：填充缺失值

news 2026/6/30 2:02:43

13. 缺失值处理（下）：填充缺失值

1. 概述

删除缺失值会丢失信息，而填充是更常用的处理方法。Pandas 提供了fillna()、interpolate()等多种填充方法，可以根据数据特点选择最合适的填充策略。

importpandasaspdimportnumpyasnp# 创建包含缺失值的示例数据np.random.seed(42)df=pd.DataFrame({'姓名':['张三','李四','王五','赵六','钱七','孙八','周九','吴十'],'年龄':[25,np.nan,28,32,np.nan,27,29,np.nan],'工资':[8000,12000,np.nan,15000,11000,9500,np.nan,12500],'部门':['技术','销售','技术',np.nan,'销售','技术','市场','销售']})print("原始数据:")print(df)

2. fillna() 基础

2.1 填充固定值

# 填充常数值print("用0填充所有缺失值:")print(df.fillna(0))# 填充字符串print("\n用'未知'填充缺失值:")print(df.fillna('未知'))# 填充字典（不同列不同值）fill_values={'年龄':30,'工资':10000,'部门':'其他'}print("\n不同列填充不同值:")print(df.fillna(fill_values))

2.2 填充统计值

# 用均值填充print("用均值填充年龄:")mean_age=df['年龄'].mean()df['年龄_filled']=df['年龄'].fillna(mean_age)print(df[['姓名','年龄','年龄_filled']])# 用中位数填充print("\n用中位数填充工资:")median_salary=df['工资'].median()df['工资_filled']=df['工资'].fillna(median_salary)print(df[['姓名','工资','工资_filled']])# 用众数填充print("\n用众数填充部门:")mode_dept=df['部门'].mode()[0]df['部门_filled']=df['部门'].fillna(mode_dept)print(df[['姓名','部门','部门_filled']])

3. 向前/向后填充

3.1 向前填充（ffill）

ffill用前一个非缺失值填充当前缺失值。

# 创建有序数据importpandasaspdimportnumpyasnp time_series=pd.DataFrame({'日期':pd.date_range('2024-01-01',periods=10),'销量':[100,np.nan,np.nan,130,140,np.nan,160,np.nan,np.nan,190]})print("原始时间序列:")print(time_series)# 向前填充print("\n向前填充 (ffill):")time_series['销量_ffill']=time_series['销量'].fillna(method='ffill')print(time_series)# 限制填充数量print("\n限制最多填充2个:")time_series['销量_limit']=time_series['销量'].fillna(method='ffill',limit=2)print(time_series)

3.2 向后填充（bfill）

bfill用后一个非缺失值填充当前缺失值。

# 向后填充print("向后填充 (bfill):")time_series['销量_bfill']=time_series['销量'].fillna(method='bfill')print(time_series)# 对 DataFrame 整体操作print("\nDataFrame 整体向前填充:")print(df.fillna(method='ffill'))

3.3 填充方向选择

方法	说明	适用场景
`ffill`	用前一个值填充	时间序列、有序数据
`bfill`	用后一个值填充	时间序列、有序数据
`limit`	限制填充数量	避免过度填充

4. 插值填充

4.1 线性插值

interpolate()根据相邻点的值计算缺失值。

# 创建示例数据importmatplotlib.pyplotasplt np.random.seed(42)x=np.linspace(0,10,20)y=np.sin(x)+np.random.normal(0,0.1,20)# 随机删除一些点missing_idx=[3,7,12,15]y_with_nan=y.copy()y_with_nan[missing_idx]=np.nan df_interp=pd.DataFrame({'x':x,'y':y_with_nan})print("原始数据:")print(df_interp)# 线性插值print("\n线性插值:")df_interp['y_linear']=df_interp['y'].interpolate(method='linear')print(df_interp)# 可视化对比plt.figure(figsize=(12,4))plt.plot(x,y,'o-',label='原始数据',alpha=0.7)plt.plot(x,y_with_nan,'rx',label='缺失点',markersize=10)plt.plot(x,df_interp['y_linear'],'s--',label='线性插值',alpha=0.7)plt.legend()plt.title('线性插值效果')plt.show()

4.2 多种插值方法

# 创建更复杂的示例x=np.array([1,2,3,4,5,6,7,8,9,10])y=np.array([10,np.nan,np.nan,40,50,np.nan,70,80,np.nan,100])df_interp2=pd.DataFrame({'x':x,'y':y})print("原始数据:")print(df_interp2)# 不同插值方法methods=['linear','quadratic','cubic','polynomial']fig,axes=plt.subplots(2,2,figsize=(12,8))axes=axes.flatten()fori,methodinenumerate(methods):ifmethod=='polynomial':df_interp2[f'y_{method}']=df_interp2['y'].interpolate(method=method,order=2)else:df_interp2[f'y_{method}']=df_interp2['y'].interpolate(method=method)axes[i].plot(x,y,'o-',label='原始',alpha=0.7)axes[i].plot(x,df_interp2[f'y_{method}'],'s--',label=method,alpha=0.7)axes[i].legend()axes[i].set_title(f'{method}插值')plt.tight_layout()plt.show()

5. 分组填充

5.1 按组填充均值

# 按部门分组填充工资print("原始数据:")print(df)# 计算各部门平均工资dept_mean=df.groupby('部门')['工资'].mean()print("\n各部门平均工资:")print(dept_mean)# 按部门填充df['工资_group_fill']=df.groupby('部门')['工资'].transform(lambdax:x.fillna(x.mean()))print("\n按部门填充后:")print(df[['姓名','部门','工资','工资_group_fill']])

5.2 按组填充众数

# 按部门填充年龄（用中位数）df['年龄_group_fill']=df.groupby('部门')['年龄'].transform(lambdax:x.fillna(x.median()))print("按部门填充年龄:")print(df[['姓名','部门','年龄','年龄_group_fill']])

6. 高级填充策略

6.1 使用机器学习预测填充

fromsklearn.ensembleimportRandomForestRegressorfromsklearn.imputeimportSimpleImputer# 准备数据df_ml=df.copy()# 分离有缺失和无缺失的行train_df=df_ml[df_ml['工资'].notna()]predict_df=df_ml[df_ml['工资'].isna()]iflen(predict_df)>0:# 特征工程features=['年龄']# 处理年龄缺失age_imputer=SimpleImputer(strategy='mean')train_age=age_imputer.fit_transform(train_df[features])predict_age=age_imputer.transform(predict_df[features])# 训练模型model=RandomForestRegressor(n_estimators=100,random_state=42)model.fit(train_age,train_df['工资'])# 预测缺失值predicted_salaries=model.predict(predict_age)print(f"预测的工资:{predicted_salaries}")

6.2 KNN 填充

fromsklearn.imputeimportKNNImputer# KNN 填充knn_imputer=KNNImputer(n_neighbors=3)df_knn=df.copy()# 选择数值列numeric_cols=['年龄','工资']df_knn[numeric_cols]=knn_imputer.fit_transform(df_knn[numeric_cols])print("KNN 填充后:")print(df_knn)

7. 填充方法对比

方法	优点	缺点	适用场景
固定值	简单快速	可能引入偏差	占位符、明确默认值
均值/中位数	不改变整体分布	降低方差	数值型、数据分布对称
众数	适合分类变量	可能过于简化	分类变量
前/后填充	保持趋势	不适合随机缺失	时间序列
线性插值	平滑过渡	假设线性关系	连续数据
分组填充	考虑组内差异	需要分组依据	有明确分组
机器学习	精度高	复杂度高	重要特征、数据量大

8. 完整示例：客户数据填充

# 创建客户数据np.random.seed(42)customers=pd.DataFrame({'客户ID':range(1,101),'年龄':np.random.randint(18,70,100),'收入':np.random.normal(8000,2000,100).round(0),'消费次数':np.random.poisson(10,100),'城市':np.random.choice(['北京','上海','广州','深圳'],100),'会员等级':np.random.choice(['普通','黄金','铂金','钻石'],100)})# 随机添加缺失值forcolin['年龄','收入','消费次数','城市','会员等级']:idx=np.random.choice(100,10,replace=False)customers.loc[idx,col]=np.nanprint("="*60)print("客户数据缺失值填充")print("="*60)print("\n原始缺失情况:")print(customers.isna().sum())# 1. 年龄：用中位数填充customers['年龄']=customers['年龄'].fillna(customers['年龄'].median())# 2. 收入：按城市分组用均值填充city_income_mean=customers.groupby('城市')['收入'].mean()customers['收入']=customers.groupby('城市')['收入'].transform(lambdax:x.fillna(x.mean()))# 3. 消费次数：用均值填充customers['消费次数']=customers['消费次数'].fillna(customers['消费次数'].mean()).round(0).astype(int)# 4. 城市：用众数填充customers['城市']=customers['城市'].fillna(customers['城市'].mode()[0])# 5. 会员等级：用众数填充customers['会员等级']=customers['会员等级'].fillna(customers['会员等级'].mode()[0])print("\n填充后缺失情况:")print(customers.isna().sum())print("\n填充后数据:")print(customers.head(10))

9. 填充决策流程

发现缺失值 │ ├─ 数值型变量 │ │ │ ├─ 数据量小 → 均值/中位数填充 │ ├─ 时间序列 → 前/后填充或插值 │ ├─ 有分组 → 分组均值填充 │ └─ 数据量大 → 机器学习填充 │ ├─ 分类变量 │ │ │ ├─ 有明确默认值 → 固定值填充 │ ├─ 无默认值 → 众数填充 │ └─ 重要变量 → 单独作为一类 │ └─ 缺失率过高（>70%） │ └─ 考虑删除该列

10. 总结

方法	函数	示例
固定值	`fillna(value)`	`df.fillna(0)`
均值填充	`fillna(mean)`	`df['col'].fillna(df['col'].mean())`
中位数填充	`fillna(median)`	`df['col'].fillna(df['col'].median())`
众数填充	`fillna(mode)`	`df['col'].fillna(df['col'].mode()[0])`
向前填充	`fillna(method='ffill')`	`df.fillna(method='ffill')`
向后填充	`fillna(method='bfill')`	`df.fillna(method='bfill')`
线性插值	`interpolate()`	`df['col'].interpolate()`
分组填充	`groupby().transform()`	`df.groupby('group')['col'].transform(lambda x: x.fillna(x.mean()))`