当前位置：首页 > news >正文

模块四-数据转换与操作——25. 哑变量与编码

news 2026/7/3 5:38:35

25. 哑变量与编码

1. 概述

在机器学习和数据分析中，大多数算法只能处理数值型数据。哑变量（Dummy Variable）和编码技术用于将分类变量转换为数值型，是特征工程中的重要步骤。

importpandasaspdimportnumpyasnp# 创建示例数据df=pd.DataFrame({'姓名':['张三','李四','王五','赵六','钱七'],'城市':['北京','上海','广州','深圳','杭州'],'学历':['本科','硕士','本科','博士','硕士'],'部门':['技术','销售','技术','市场','销售'],'等级':['A','B','A','C','B']})print("原始数据:")print(df)

2. 独热编码（One-Hot Encoding）

2.1 get_dummies() 基础

独热编码将每个类别转换为一个独立的二元特征（0或1）。

# 单列独热编码one_hot=pd.get_dummies(df['城市'],prefix='城市')print("城市独热编码:")print(one_hot)# 合并到原 DataFramedf_encoded=pd.concat([df,one_hot],axis=1)print("\n合并后:")print(df_encoded.head())

2.2 多列独热编码

# 多列同时编码one_hot_multi=pd.get_dummies(df[['城市','学历']],prefix=['城市','学历'])print("多列独热编码:")print(one_hot_multi)# 对整个 DataFrame 编码（只对 object 类型列编码）df_all_dummies=pd.get_dummies(df)print("\n整个 DataFrame 编码:")print(df_all_dummies)

2.3 参数说明

# prefix 前缀one_hot_prefix=pd.get_dummies(df['城市'],prefix='city')# prefix_sep 分隔符one_hot_sep=pd.get_dummies(df['城市'],prefix='城市',prefix_sep='_')# drop_first 丢弃第一个类别（避免多重共线性）one_hot_drop=pd.get_dummies(df['城市'],drop_first=True)print("drop_first=True（丢弃第一个类别）:")print(one_hot_drop)

3. 标签编码（Label Encoding）

3.1 使用 map() 进行标签编码

# 手动创建编码映射level_map={'A':1,'B':2,'C':3}df['等级_编码']=df['等级'].map(level_map)print("标签编码:")print(df[['等级','等级_编码']])

3.2 使用 factorize() 方法

# factorize 自动分配编码df['城市_编码'],city_codes=pd.factorize(df['城市'])print("factorize 编码:")print(df[['城市','城市_编码']])print(f"\n编码映射:{dict(enumerate(city_codes))}")

3.3 使用 astype(‘category’) 然后 cat.codes

# 转换为 category 类型后获取编码df['学历_编码']=df['学历'].astype('category').cat.codesprint("category 编码:")print(df[['学历','学历_编码']])

4. 频率编码（Frequency Encoding）

# 计算每个类别的频率city_freq=df['城市'].value_counts()/len(df)df['城市_频率']=df['城市'].map(city_freq)print("频率编码:")print(df[['城市','城市_频率']])# 计次编码city_count=df['城市'].value_counts()df['城市_计数']=df['城市'].map(city_count)print("\n计次编码:")print(df[['城市','城市_计数']])

5. 目标编码（Target Encoding）

目标编码使用目标变量的均值替换类别值。

# 创建带目标变量的数据df_target=pd.DataFrame({'城市':['北京','上海','广州','北京','上海','广州','北京','上海','广州'],'目标':[1,0,1,1,0,0,1,1,0]})# 计算每个城市的平均目标值city_mean=df_target.groupby('城市')['目标'].mean()df_target['城市_目标编码']=df_target['城市'].map(city_mean)print("目标编码:")print(df_target)

6. 序数编码（Ordinal Encoding）

用于有顺序关系的分类变量。

# 创建有序数据df_ordinal=pd.DataFrame({'评价':['差','中','良','优','差','良','优','中','良'],'满意度':['不满意','一般','满意','非常满意','一般','满意','非常满意','不满意','满意']})# 序数编码映射rating_map={'差':1,'中':2,'良':3,'优':4}satisfaction_map={'不满意':1,'一般':2,'满意':3,'非常满意':4}df_ordinal['评价_编码']=df_ordinal['评价'].map(rating_map)df_ordinal['满意度_编码']=df_ordinal['满意度'].map(satisfaction_map)print("序数编码:")print(df_ordinal)

7. 编码方法对比

编码方法	说明	优点	缺点	适用场景
独热编码	每个类别一个特征	无顺序假设	维度膨胀	类别少（<10）
标签编码	整数编码	简单、不增维度	引入顺序关系	树模型
频率编码	用频率替换	简单、捕获分布	可能过拟合	高基数类别
目标编码	用目标均值替换	捕获目标关系	易过拟合	有监督学习
序数编码	按顺序编码	保留顺序信息	需要知道顺序	有序类别

8. 完整示例：客户特征编码

# 创建客户数据np.random.seed(42)customers=pd.DataFrame({'customer_id':range(1,101),'城市':np.random.choice(['北京','上海','广州','深圳','杭州','成都','武汉'],100),'教育程度':np.random.choice(['高中','大专','本科','硕士','博士'],100,p=[0.1,0.2,0.4,0.2,0.1]),'职业':np.random.choice(['技术','销售','市场','人事','财务','管理'],100),'会员等级':np.random.choice(['普通','黄金','铂金','钻石'],100),'是否购买':np.random.choice([0,1],100,p=[0.6,0.4])})print("="*60)print("客户特征编码")print("="*60)print("\n原始数据:")print(customers.head())print(f"原始维度:{customers.shape}")# 1. 独热编码（适用于低基数类别）print("\n1. 城市独热编码:")city_dummies=pd.get_dummies(customers['城市'],prefix='city')print(f"独热编码后维度:{city_dummies.shape}")# 2. 教育程度序数编码print("\n2. 教育程度序数编码:")edu_map={'高中':1,'大专':2,'本科':3,'硕士':4,'博士':5}customers['教育_编码']=customers['教育程度'].map(edu_map)print(customers[['教育程度','教育_编码']].drop_duplicates())# 3. 会员等级序数编码print("\n3. 会员等级序数编码:")level_map={'普通':1,'黄金':2,'铂金':3,'钻石':4}customers['会员_编码']=customers['会员等级'].map(level_map)print(customers[['会员等级','会员_编码']].drop_duplicates())# 4. 职业独热编码（降维处理）print("\n4. 职业独热编码:")job_dummies=pd.get_dummies(customers['职业'],prefix='job')print(f"职业独热编码维度:{job_dummies.shape}")# 5. 目标编码（使用是否购买作为目标）print("\n5. 城市目标编码:")city_target_mean=customers.groupby('城市')['是否购买'].mean()customers['城市_目标编码']=customers['城市'].map(city_target_mean)print(customers[['城市','城市_目标编码']].drop_duplicates())# 6. 组合所有编码特征print("\n6. 组合特征:")# 保留原始IDfinal_features=customers[['customer_id']].copy()# 添加数值特征final_features['教育_编码']=customers['教育_编码']final_features['会员_编码']=customers['会员_编码']final_features['城市_目标编码']=customers['城市_目标编码']# 添加独热编码final_features=pd.concat([final_features,city_dummies,job_dummies],axis=1)print(f"最终特征维度:{final_features.shape}")print("\n特征列表:")print(final_features.columns.tolist())

9. 编码方法选择指南

分类变量编码 │ ├─ 类别数量少（<10） │ │ │ ├─ 无序类别 → 独热编码 │ └─ 有序类别 → 序数编码 │ ├─ 类别数量多（≥10） │ │ │ ├─ 树模型 → 标签编码 │ ├─ 线性模型 → 频率编码/目标编码 │ └─ 深度学习 → 嵌入（Embedding） │ └─ 有目标变量 │ └─ 目标编码（注意过拟合）

10. 总结

函数/方法	用途	示例
`pd.get_dummies()`	独热编码	`pd.get_dummies(df['col'])`
`pd.get_dummies(drop_first=True)`	独热编码（避免共线性）	`pd.get_dummies(df['col'], drop_first=True)`
`pd.factorize()`	标签编码	`pd.factorize(df['col'])`
`astype('category').cat.codes`	标签编码	`df['col'].astype('category').cat.codes`
`map(dict)`	自定义编码	`df['col'].map({'A':1, 'B':2})`
`value_counts() / len(df)`	频率编码	`df['col'].map(df['col'].value_counts()/len(df))`
`groupby().transform('mean')`	目标编码	`df.groupby('col')['target'].transform('mean')`