当前位置：首页 > news >正文

别再只画散点图了！用Python的sklearn和matplotlib，5分钟搞定PCA双标图（含置信椭圆绘制）

news 2026/6/24 8:18:00

用Python打造学术级PCA可视化：从双标图到置信椭圆的完整指南

当你完成PCA分析后，是否曾为如何将结果优雅地呈现给导师或客户而苦恼？那些发表在顶级期刊上的PCA双标图，为什么总比你的更专业、信息量更大？本文将带你突破常规散点图的局限，用Python实现可直接用于学术论文的PCA可视化方案。

1. 为什么你的PCA图表总是不够专业？

大多数数据分析师在完成PCA降维后，往往止步于基础的散点图展示。这种图表虽然能反映样本分布，但缺乏关键信息：

变量贡献不明确：无法直观看出哪些原始变量对主成分影响最大
统计显著性缺失：没有置信区间，难以评估分类结果的稳定性
信息密度不足：多个维度的信息被压缩在简单的二维图中

以《Nature Communications》上一篇论文的PCA图表为例，其专业之处在于：

同时展示样本点和变量载荷
用置信椭圆标注不同类别的分布范围
清晰标注每个主成分的解释方差比例
采用学术期刊偏爱的视觉风格

# 典型学术论文中的PCA图表要素 学术级PCA图表 = { "双标图": True, # 同时显示样本和变量 "置信椭圆": True, # 通常为95%置信区间 "方差解释率": True, # 坐标轴标注解释百分比 "学术风格": True # Times New Roman字体、适当留白 }

2. 五分钟打造基础双标图：比散点图多什么？

双标图(Biplot)是PCA最全面的可视化方式，它在一个图中同时呈现：

样本点：降维后的分布情况
变量向量：原始变量在主成分空间中的方向与重要性

2.1 基础双标图实现

使用sklearn和matplotlib的基础实现：

import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.datasets import load_iris # 数据准备 iris = load_iris() X = iris.data y = iris.target features = iris.feature_names # 标准化与PCA X_std = (X - X.mean(axis=0)) / X.std(axis=0) pca = PCA() X_pca = pca.fit_transform(X_std) # 绘图函数 def basic_biplot(X_pca, y, pca, feature_names): plt.figure(figsize=(8,6), dpi=100) # 样本点 scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=y, alpha=0.8) # 变量向量 for i, (comp, name) in enumerate(zip(pca.components_.T, feature_names)): plt.arrow(0, 0, comp[0]*0.8, comp[1]*0.8, color='r', head_width=0.03) plt.text(comp[0]*0.85, comp[1]*0.85, name, color='darkred', fontsize=10) # 格式设置 plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)") plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)") plt.grid(linestyle='--', alpha=0.6) plt.legend(*scatter.legend_elements(), title="Classes") plt.show() basic_biplot(X_pca, y, pca, features)

2.2 解读双标图的关键

向量长度：代表变量对主成分的贡献大小
向量角度：反映变量间的相关性（夹角小=正相关，接近180°=负相关）
样本位置：在变量向量上的投影反映原始变量值

提示：变量向量通常需要缩放，否则可能挤在中心区域。上例中使用0.8的缩放因子保持可读性。

3. 添加置信椭圆：评估分类稳定性的关键

置信椭圆是区分学术级PCA图表的标志性元素，它能直观展示：

不同类别在降维空间中的分布范围
类别间的重叠程度（分类难度）
结果的统计显著性（通常用95%置信区间）

3.1 置信椭圆实现原理

置信椭圆基于多元正态分布的假设，其参数包括：

中心点：样本点的均值
方向：由协方差矩阵的特征向量决定
半径：与特征值的平方根成正比

from matplotlib.patches import Ellipse def confidence_ellipse(x, y, ax, n_std=2, **kwargs): cov = np.cov(x, y) lambda_, v = np.linalg.eig(cov) lambda_ = np.sqrt(lambda_) ellipse = Ellipse(xy=(np.mean(x), np.mean(y)), width=lambda_[0]*n_std*2, height=lambda_[1]*n_std*2, angle=np.degrees(np.arctan2(v[1,0],v[0,0])), **kwargs) return ax.add_patch(ellipse)

3.2 整合置信椭圆的双标图

将上述函数整合到双标图中：

def enhanced_biplot(X_pca, y, pca, feature_names, n_std=2): fig, ax = plt.subplots(figsize=(9,7), dpi=120) colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] # 绘制置信椭圆 for i in np.unique(y): x_data = X_pca[y==i, 0] y_data = X_pca[y==i, 1] confidence_ellipse(x_data, y_data, ax, n_std=n_std, alpha=0.2, color=colors[i]) # 绘制样本点 scatter = ax.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap='viridis', edgecolor='k', lw=0.5) # 绘制变量向量 for i, (comp, name) in enumerate(zip(pca.components_.T, feature_names)): ax.arrow(0, 0, comp[0]*0.7, comp[1]*0.7, color='red', head_width=0.03, alpha=0.8) ax.text(comp[0]*0.75, comp[1]*0.75, name, color='darkred', fontsize=11, fontfamily='Times New Roman') # 学术风格设置 ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)", fontfamily='Times New Roman', size=12) ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)", fontfamily='Times New Roman', size=12) ax.grid(linestyle='--', alpha=0.4) ax.legend(*scatter.legend_elements(), prop={'family':'Times New Roman', 'size':10}) plt.tight_layout() return fig fig = enhanced_biplot(X_pca, y, pca, features) fig.savefig('professional_pca.png', dpi=300, bbox_inches='tight')

4. 进阶技巧：提升图表信息密度

4.1 添加变量贡献条

在双标图旁添加变量对主成分的贡献条：

def add_contribution_bars(pca, feature_names, ax=None): if ax is None: fig, ax = plt.subplots(figsize=(5,4)) # 计算变量贡献度 contribution = pca.components_**2 contribution = contribution / contribution.sum(axis=1, keepdims=True) # 绘制横向条形图 y_pos = np.arange(len(feature_names)) ax.barh(y_pos, contribution[0,:], color='steelblue', alpha=0.7) # 设置样式 ax.set_yticks(y_pos) ax.set_yticklabels(feature_names, fontfamily='Times New Roman') ax.set_xlabel('Contribution to PC1', fontfamily='Times New Roman') ax.grid(axis='x', linestyle='--', alpha=0.6) return ax # 使用示例 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,6), gridspec_kw={'width_ratios':[2,1]}) enhanced_biplot(X_pca, y, pca, features, ax=ax1) add_contribution_bars(pca, features, ax=ax2) plt.tight_layout()

4.2 动态交互式可视化

使用Plotly创建可交互的PCA图表：

import plotly.express as px import pandas as pd # 准备DataFrame df = pd.DataFrame(X_pca[:,:2], columns=['PC1', 'PC2']) df['Species'] = iris.target_names[y] df['Feature_Vectors'] = ['']*len(df) # 添加变量向量数据 for i, feat in enumerate(features): df.loc[len(df)] = [pca.components_[0,i]*3, pca.components_[1,i]*3, 'Vector', feat] # 创建图表 fig = px.scatter(df, x='PC1', y='PC2', color='Species', hover_name='Feature_Vectors', symbol_sequence=['circle']*150+['triangle-up']*4, width=800, height=600) # 添加置信椭圆 for i, name in enumerate(iris.target_names): fig.add_shape(type='ellipse', x0=df[df['Species']==name]['PC1'].mean()-0.5, y0=df[df['Species']==name]['PC2'].mean()-0.3, x1=df[df['Species']==name]['PC1'].mean()+0.5, y1=df[df['Species']==name]['PC2'].mean()+0.3, opacity=0.2, fillcolor=px.colors.qualitative.Plotly[i]) fig.update_layout(font_family="Times New Roman") fig.show()

5. 学术图表优化清单

最后，分享我在准备论文图表时的检查清单：

字体统一：全图使用Times New Roman或Arial
分辨率足够：至少300dpi，保存为PDF或PNG
坐标轴标签：包含方差解释百分比
图例清晰：说明所有符号和颜色的含义
比例适当：避免图形变形（设置equal aspect）
留白合理：避免边缘截断，使用tight_layout()
色彩友好：考虑色盲读者，使用ColorBrewer配色

# 学术图表保存最佳实践 def save_for_publication(fig, filename): fig.savefig( f"{filename}.png", dpi=300, bbox_inches='tight', facecolor='white' ) fig.savefig( f"{filename}.pdf", format='pdf', bbox_inches='tight', facecolor='white' )

查看全文

http://www.jsqmd.com/news/681363/