当前位置: 首页 > news >正文

特征工程:从数据到特征

特征工程:从数据到特征

1. 技术分析

1.1 特征工程流程

特征工程是机器学习的核心环节:

特征工程流程 数据理解 → 特征提取 → 特征选择 → 特征转换 → 特征验证

1.2 特征类型

类型描述处理方法
数值型连续数值归一化、标准化
分类型类别标签独热编码、标签编码
文本型文本数据TF-IDF、Word2Vec
时间型时间数据时间差、周期性特征
空间型地理数据距离计算、网格编码

1.3 特征选择方法

特征选择方法 过滤法: 基于统计指标 包裹法: 基于模型性能 嵌入法: 基于模型内部特征

2. 核心功能实现

2.1 特征提取

import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import OneHotEncoder, StandardScaler class FeatureExtractor: def __init__(self): self.extractors = {} def add_extractor(self, name, extractor): self.extractors[name] = extractor def extract(self, data): features = {} for name, extractor in self.extractors.items(): if hasattr(extractor, 'fit_transform'): features[name] = extractor.fit_transform(data) else: features[name] = extractor(data) return features class NumericalFeatureExtractor: def __init__(self, columns=None): self.columns = columns self.scaler = StandardScaler() def fit_transform(self, df): if self.columns: data = df[self.columns] else: data = df.select_dtypes(include=[np.number]) return self.scaler.fit_transform(data) def transform(self, df): if self.columns: data = df[self.columns] else: data = df.select_dtypes(include=[np.number]) return self.scaler.transform(data) class CategoricalFeatureExtractor: def __init__(self, columns=None): self.columns = columns self.encoder = OneHotEncoder(sparse=False, handle_unknown='ignore') def fit_transform(self, df): if self.columns: data = df[self.columns] else: data = df.select_dtypes(include=['object']) return self.encoder.fit_transform(data) def transform(self, df): if self.columns: data = df[self.columns] else: data = df.select_dtypes(include=['object']) return self.encoder.transform(data) class TextFeatureExtractor: def __init__(self, column, max_features=5000): self.column = column self.vectorizer = TfidfVectorizer(max_features=max_features) def fit_transform(self, df): return self.vectorizer.fit_transform(df[self.column]).toarray() def transform(self, df): return self.vectorizer.transform(df[self.column]).toarray()

2.2 特征选择

from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE from sklearn.ensemble import RandomForestClassifier class FeatureSelector: def __init__(self, method='filter', k=10): self.method = method self.k = k self.selector = None def fit(self, X, y): if self.method == 'filter': self.selector = SelectKBest(score_func=mutual_info_classif, k=self.k) elif self.method == 'rfe': estimator = RandomForestClassifier() self.selector = RFE(estimator, n_features_to_select=self.k) self.selector.fit(X, y) def transform(self, X): return self.selector.transform(X) def get_selected_features(self): if hasattr(self.selector, 'get_support'): return self.selector.get_support(indices=True) return self.selector.ranking_ class FeatureImportanceAnalyzer: def __init__(self, model): self.model = model def analyze(self, X, y, feature_names): self.model.fit(X, y) if hasattr(self.model, 'feature_importances_'): importances = self.model.feature_importances_ elif hasattr(self.model, 'coef_'): importances = np.abs(self.model.coef_[0]) else: return None indices = np.argsort(importances)[::-1] return [(feature_names[i], importances[i]) for i in indices] class DimensionalityReducer: def __init__(self, method='pca', n_components=2): self.method = method self.n_components = n_components if method == 'pca': from sklearn.decomposition import PCA self.reducer = PCA(n_components=n_components) elif method == 'tsne': from sklearn.manifold import TSNE self.reducer = TSNE(n_components=n_components) elif method == 'umap': import umap self.reducer = umap.UMAP(n_components=n_components) def fit_transform(self, X): return self.reducer.fit_transform(X) def transform(self, X): return self.reducer.transform(X)

2.3 特征验证

class FeatureValidator: def __init__(self): pass def check_missing_values(self, df): missing = df.isnull().sum() return missing[missing > 0] def check_cardinality(self, df, threshold=100): high_cardinality = [] for col in df.columns: if df[col].nunique() > threshold: high_cardinality.append(col) return high_cardinality def check_feature_correlation(self, df, threshold=0.8): corr_matrix = df.corr().abs() high_corr = [] for i in range(len(corr_matrix.columns)): for j in range(i): if corr_matrix.iloc[i, j] > threshold: high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j])) return high_corr class FeatureDriftDetector: def __init__(self): pass def detect_drift(self, reference_data, current_data, threshold=0.05): drift_scores = [] for col in reference_data.columns: if reference_data[col].dtype in ['int64', 'float64']: ref_mean = reference_data[col].mean() curr_mean = current_data[col].mean() diff = abs(ref_mean - curr_mean) / ref_mean if diff > threshold: drift_scores.append((col, diff)) return drift_scores class FeatureStore: def __init__(self): self.features = {} def add_feature(self, name, feature): self.features[name] = feature def get_feature(self, name): return self.features.get(name) def save(self, path): import pickle with open(path, 'wb') as f: pickle.dump(self.features, f) @classmethod def load(cls, path): import pickle with open(path, 'rb') as f: features = pickle.load(f) store = cls() store.features = features return store

3. 性能对比

3.1 特征选择方法对比

方法计算速度效果适用场景
Filter高维数据
RFE中等维度
Embedded通用

3.2 降维方法对比

方法保留信息计算速度可视化效果
PCA
t-SNE
UMAP

3.3 特征编码方法对比

方法维度扩展处理速度适用场景
One-Hot低基数
Label有序类别
Embedding可控高基数

4. 最佳实践

4.1 特征工程流程

def build_feature_pipeline(config): extractors = [] if config.get('numerical', True): extractors.append(NumericalFeatureExtractor()) if config.get('categorical', True): extractors.append(CategoricalFeatureExtractor()) if config.get('text', False): extractors.append(TextFeatureExtractor('text')) return extractors class FeatureEngineeringPipeline: def __init__(self, extractors, selector=None, reducer=None): self.extractors = extractors self.selector = selector self.reducer = reducer def fit_transform(self, data): features = [] for extractor in self.extractors: features.append(extractor.fit_transform(data)) X = np.hstack(features) if self.selector: self.selector.fit(X, data['target']) X = self.selector.transform(X) if self.reducer: X = self.reducer.fit_transform(X) return X def transform(self, data): features = [] for extractor in self.extractors: features.append(extractor.transform(data)) X = np.hstack(features) if self.selector: X = self.selector.transform(X) if self.reducer: X = self.reducer.transform(X) return X

4.2 特征验证流程

class FeatureValidationPipeline: def __init__(self): self.validator = FeatureValidator() self.drift_detector = FeatureDriftDetector() def validate(self, df): issues = {} missing = self.validator.check_missing_values(df) if len(missing) > 0: issues['missing_values'] = missing.to_dict() high_card = self.validator.check_cardinality(df) if len(high_card) > 0: issues['high_cardinality'] = high_card high_corr = self.validator.check_feature_correlation(df) if len(high_corr) > 0: issues['high_correlation'] = high_corr return issues def detect_drift(self, reference, current): return self.drift_detector.detect_drift(reference, current)

5. 总结

特征工程是机器学习成功的关键:

  1. 特征提取:从原始数据中提取有价值的特征
  2. 特征选择:选择最有信息量的特征
  3. 特征验证:确保特征质量
  4. 特征存储:管理和复用特征

对比数据如下:

  • UMAP 在降维可视化上效果最好
  • RFE 特征选择效果最佳但速度较慢
  • One-Hot 编码适合低基数类别特征
  • 推荐使用特征存储系统管理特征
http://www.jsqmd.com/news/814001/

相关文章:

  • 终极AMD Ryzen处理器调试指南:如何用SMU Debug Tool精准优化硬件性能
  • 零依赖Node.js工具:分析AI编程对话情绪与沟通模式
  • ComfyUI-Impact-Pack V8完整实战指南:解锁AI图像增强终极方案
  • 超导量子计算中的双量子比特门实现与优化
  • Agent工程师爆增310%!2026年最紧缺的AI岗位,高薪抢人背后的人才战争!
  • 【大白话说Java面试题 第48题】【JVM篇】第8题:JVM 里的有几种 ClassLoader?为什么会有多种?
  • 离散化离散化差分
  • 本地AI智能体Resonance:构建私有化系统级AI助手的完整指南
  • 冠珠瓷砖×莫氏鸡煲×叠滘东胜东队,德叔有请,莫叔掌勺,“力撑”叠滘龙船传承
  • FPGA覆盖配置优化:AI预测模型实践与效率提升
  • .NET 8 Web开发入门(四):注入燃料——Entity Framework Core 与 Code First 实战
  • 基于C语言实现(控制台)小型文件系统
  • 在多团队协作中通过Taotoken实现API密钥的权限隔离与审计追踪
  • Git Ignore
  • 终极Flash浏览器指南:如何在现代浏览器中畅玩经典Flash游戏
  • 从怀疑到真香!用了半年我只留下这一个,2026把录音转文字的app真的太好用了
  • 5分钟掌握RePKG:Wallpaper Engine资源提取与格式转换的终极秘籍
  • Claude API智能代理网关:架构设计、部署与生产实践
  • AGENTS.md:为AI编码助手定制的项目说明书,提升人机协作效率
  • 保姆级教程:Ubuntu 18.04下Mellanox ConnectX-3 IB网卡从驱动安装到IP配置全流程(解决ibstat状态异常)
  • XUnity.AutoTranslator完整指南:让外语游戏瞬间变中文的免费神器
  • 支持多渠道的语音机器人 2026 企业选型攻略:智能核心引擎
  • Gemini Pro私有知识库接入终极方案:RAG+微调双路径落地(含向量分块策略、重排序阈值、LLM幻觉抑制三重校验)
  • 微服务安全实践:Trust-Gate-Plugin 插件实现去中心化服务间认证与授权
  • 轻量级容器场景下 Docker 与 LXC 性能开销对比测试数据参考
  • 从第一大道的突围,到《凰标》的安稳立界@凤凰标志
  • OBS Multi RTMP插件深度解析:多平台直播的完整实战手册
  • QMCDecode终极指南:一键解锁QQ音乐加密音频的完整解决方案
  • 第一大道写传奇人生,《凰标》写文明传承根脉@凤凰标志
  • AI智能体集成Discourse社区:OpenClaw插件配置与自动化实践