当前位置：首页 > news >正文

用Python处理SEED-VIG脑电数据：从PERCLOS标签到EEG特征提取的完整流程

news 2026/3/26 6:39:16

用Python处理SEED-VIG脑电数据：从PERCLOS标签到EEG特征提取的完整流程

在神经工程和驾驶安全研究中，SEED-VIG数据集因其高质量的多模态生理信号采集而备受关注。这个包含EEG、EOG和眼动追踪数据的资源，为疲劳检测算法开发提供了宝贵素材。本文将手把手带您完成从原始数据加载到特征工程的全流程，重点解决三个实际问题：如何用Python高效处理.npy格式的脑电数据？怎样将PERCLOS标签与EEG特征精准对齐？哪些特征提取方法能最大化模型性能？

1. 环境配置与数据加载

工欲善其事，必先利其器。我们首先搭建包含以下核心工具链的Python环境：

pip install numpy scipy matplotlib mne pandas scikit-learn

数据集解压后通常会看到这些关键文件：

EEG_Feature_5Bands.npy：5个频段的PSD/DE特征
PERCLOS_labels.npy：连续型疲劳度标签
channel_names.txt：62个电极通道名称

用NumPy加载数据时需特别注意内存管理。对于大型.npy文件，推荐使用内存映射模式：

import numpy as np eeg_data = np.load('EEG_Feature_5Bands.npy', mmap_mode='r') labels = np.load('PERCLOS_labels.npy')

数据维度解析：

文件类型	维度	说明
EEG_5Bands	62×885×5	62个通道×885样本×5频段
PERCLOS	885	每个时间点的疲劳度评分

注意：不同版本数据集可能存在维度顺序差异，建议先用eeg_data.shape确认

2. 数据可视化与质量检查

原始EEG数据如同未经雕琢的玉石，需要先观察其内在特征。使用MNE库创建可视化管道：

import mne info = mne.create_info(ch_names=channel_names, sfreq=200, ch_types='eeg') raw = mne.io.RawArray(eeg_data[:,:,0], info) # 展示delta波段 raw.plot_psd(fmax=50, spatial_colors=True)

常见数据问题及应对策略：

通道失效：超过20%通道噪声时考虑插值
基线漂移：应用0.5Hz高通滤波
瞬态伪迹：采用移动窗口标准差检测

# 伪迹自动检测示例 from scipy import stats def detect_artifacts(data, threshold=3): z_scores = np.abs(stats.zscore(data, axis=1)) return np.any(z_scores > threshold, axis=0)

3. 特征工程深度实践

SEED-VIG已提供PSD和DE特征，但实际建模时可能需要自定义特征集。以下是三种进阶特征提取方法：

3.1 跨频段耦合特征

计算频段间的功能连接可揭示疲劳状态下的脑网络变化：

from scipy.signal import coherence def calc_band_connectivity(data, band_pairs): conn_matrix = np.zeros((len(band_pairs), data.shape[0])) for i, (b1, b2) in enumerate(band_pairs): for ch in range(data.shape[0]): f, Cxy = coherence(data[ch,:,b1], data[ch,:,b2]) conn_matrix[i, ch] = np.mean(Cxy) return conn_matrix

3.2 时变特征提取

通过滑动窗口捕获动态特征变化：

def sliding_window_features(data, window_size=30, step=5): n_windows = (data.shape[1] - window_size) // step + 1 features = [] for i in range(n_windows): window = data[:, i*step : i*step+window_size] features.append([ np.mean(window, axis=1), np.std(window, axis=1), stats.skew(window, axis=1) ]) return np.stack(features)

3.3 多模态特征融合

将EEG特征与PERCLOS标签动态关联：

def create_fusion_features(eeg, labels, window=10): label_slope = np.convolve(labels, np.ones(window)/window, 'valid') eeg_features = sliding_window_features(eeg) return np.concatenate([ eeg_features[:-window+1], label_slope[:, np.newaxis, np.newaxis].repeat(eeg_features.shape[1], axis=1) ], axis=2)

4. 建模前的关键预处理

特征矩阵构建完成后，这几个步骤直接影响模型性能：

通道选择：基于先验知识筛选关键脑区

frontal_channels = ['Fp1','Fp2','F7','F8'] channel_mask = [name in frontal_channels for name in channel_names] selected_data = eeg_data[channel_mask]

归一化策略对比：

方法	适用场景	代码实现
Z-score	特征分布近似高斯	`sklearn.preprocessing.StandardScaler`
Robust	存在离群值	`sklearn.preprocessing.RobustScaler`
MinMax	需要固定范围	`sklearn.preprocessing.MinMaxScaler`

样本平衡技巧：

from imblearn.over_sampling import SMOTE X_resampled, y_resampled = SMOTE().fit_resample( features.reshape(-1, features.shape[-1]), (labels > 0.5).astype(int) )

5. 实战：构建疲劳检测管道

整合上述步骤构建端到端处理流程：

from sklearn.pipeline import Pipeline from sklearn.ensemble import GradientBoostingClassifier pipeline = Pipeline([ ('channel_selector', ChannelSelector(frontal_channels)), ('feature_extractor', BandConnectivityExtractor()), ('scaler', StandardScaler()), ('classifier', GradientBoostingClassifier(n_estimators=100)) ]) # 时间序列交叉验证 from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(pipeline, X, y, cv=tscv, scoring='f1')

典型性能优化路径：