当前位置：首页 > news >正文

Python实战：手把手教你用cell2location分析空间单细胞转录组数据（附完整代码）

news 2026/7/7 14:58:50

Python实战：从零掌握cell2location空间单细胞转录组分析全流程

空间单细胞转录组技术正在彻底改变我们对组织微环境的认知。想象一下，你不仅能知道组织中存在哪些细胞类型，还能精确看到它们在组织中的空间分布——这正是cell2location这类工具的魅力所在。作为生物信息学领域的新锐工具，cell2location通过整合单细胞和空间转录组数据，实现了细胞类型在空间位置上的高分辨率定位。本文将带你从零开始，用Python完整实现这套分析流程。

1. 环境准备与数据加载

在开始分析之前，我们需要搭建一个稳定的Python环境。推荐使用conda创建独立环境，避免依赖冲突：

conda create -n cell2loc python=3.8 conda activate cell2loc pip install cell2location scanpy matplotlib numpy pandas

准备好环境后，让我们先了解下数据的基本结构。空间转录组数据通常以h5ad格式存储，包含基因表达矩阵和空间坐标信息。单细胞参考数据集则提供细胞类型的转录特征。以下是数据加载的标准操作：

import scanpy as sc import cell2location # 加载Visium空间数据 adata_vis = sc.read_h5ad("./data/visium_data.h5ad") # 加载单细胞参考数据集 adata_ref = sc.read_h5ad("./data/scRNA_seq.h5ad")

注意：确保两个数据集的基因命名一致。如果空间数据使用Ensembl ID而单细胞数据使用基因符号，需要进行ID转换。

2. 数据预处理关键步骤

高质量的数据预处理是分析成功的关键。以下是几个需要特别注意的环节：

2.1 基因过滤与质量控制

cell2location对输入数据的质量要求较高。我们需要先对单细胞参考数据集进行严格过滤：

from cell2location.utils.filtering import filter_genes # 过滤低表达基因 selected = filter_genes( adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12 ) adata_ref = adata_ref[:, selected].copy()

过滤标准解释：

cell_count_cutoff: 基因至少在多少个细胞中表达
cell_percentage_cutoff2: 基因在细胞群中的表达比例阈值
nonz_mean_cutoff: 非零表达量的平均值阈值

2.2 线粒体基因处理

线粒体基因的高表达通常与细胞状态异常相关，可能干扰空间定位结果：

# 识别线粒体基因 adata_ref.var['mt'] = [gene.startswith('MT-') for gene in adata_ref.var['SYMBOL']] # 保留表达信息但排除在分析之外 adata_ref.obsm['mt'] = adata_ref[:, adata_ref.var['mt'].values].X.toarray() adata_ref = adata_ref[:, ~adata_ref.var['mt'].values]

3. 构建回归模型与训练

预处理完成后，我们可以开始构建cell2location的核心模型：

3.1 模型初始化

# 设置参考数据的模型参数 cell2location.models.RegressionModel.setup_anndata( adata=adata_ref, batch_key='Sample', labels_key='celltype' ) # 创建回归模型 from cell2location.models import RegressionModel mod = RegressionModel(adata_ref)

3.2 模型训练与评估

训练过程可能需要较长时间，取决于数据规模和硬件配置：

# 转换数据类型为整数 adata_vis.X = np.round(adata_vis.X).astype(int) adata_ref.X = np.round(adata_ref.X).astype(int) # 开始训练 mod.train( max_epochs=150, batch_size=1000, accelerator='cpu' # 使用GPU可加速 ) # 保存训练历史图 mod.plot_history(20) plt.savefig('training_history.pdf')

训练完成后，检查模型质量至关重要。重点关注两个指标：

评估指标	理想状态	问题表现
ELBO损失	平稳下降后收敛	剧烈波动或持续上升
重建精度	散点沿对角线分布	点分散远离对角线

4. 空间定位与结果可视化

模型训练完成后，就可以进行细胞类型的空间定位了：

4.1 运行cell2location模型

# 准备空间数据 cell2location.models.Cell2location.setup_anndata( adata=adata_vis, batch_key="slice" ) # 训练定位模型 mod.train( max_epochs=30000, batch_size=None, train_size=1, accelerator='cpu' ) # 导出后验分布结果 adata_vis = mod.export_posterior( adata_vis, sample_kwargs={ 'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'accelerator': 'cpu' } )

4.2 结果可视化

cell2location提供了多种可视化方式展示细胞类型的空间分布：

# 选择特定切片 from cell2location.utils import select_slide slide = select_slide(adata_vis, 'slice_1', batch_key='slice') # 绘制空间分布图 with mpl.rc_context({'axes.facecolor':'black', 'figure.figsize':[4.5,5]}): sc.pl.spatial( slide, cmap='magma', color=['B_cells', 'T_cells', 'Macrophages'], ncols=3, size=1.3, img_key='hires', vmin=0, vmax='p99.2' ) plt.savefig('cell_abundance.png')

对于更复杂的可视化需求，可以使用cell2location的高级绘图函数：

from cell2location.plt import plot_spatial # 多细胞类型组合展示 clust_labels = ['B_cells', 'T_cells', 'Macrophages'] with mpl.rc_context({'figure.figsize':(15,15)}): fig = plot_spatial( adata=slide, color=clust_labels, labels=clust_labels, show_img=True, style='fast', max_color_quantile=0.992, circle_diameter=6 ) plt.savefig('combined_cell_plot.png')

5. 实战技巧与问题排查

在实际分析中，经常会遇到各种问题。以下是几个常见问题的解决方案：

5.1 性能优化技巧

内存不足：尝试减小batch_size或使用GPU加速
训练缓慢：减少max_epochs或使用更强大的计算资源
结果不稳定：增加num_samples提高后验估计的可靠性

5.2 常见报错处理

报错信息	可能原因	解决方案
"Gene names mismatch"	基因ID不一致	统一使用Ensembl ID或基因符号
"NaN values detected"	数据包含缺失值	检查并过滤低质量细胞/基因
"CUDA out of memory"	GPU显存不足	减小batch_size或使用CPU模式

5.3 参数调优指南

cell2location有几个关键参数会影响分析结果：

# 回归模型关键参数 mod = RegressionModel( adata_ref, # 增加这些值可以提高模型灵敏度 n_prior_g=1/50, mean_prior_s=1/5000 ) # 定位模型关键参数 mod = Cell2location( adata_vis, # 调整这些参数影响细胞丰度估计 N_cells_per_location=10, detection_alpha=200 )

6. 高级应用与结果解读

掌握了基础流程后，我们可以进一步探索cell2location的高级功能：

6.1 肿瘤微环境分析

通过比较肿瘤区域和正常组织的细胞组成差异，可以识别肿瘤特异的免疫细胞浸润模式：

# 提取肿瘤区域数据 tumor_region = adata_vis[adata_vis.obs['region'] == 'tumor'].copy() # 计算细胞类型比例 celltype_proportions = tumor_region.obsm['q05_cell_abundance_w_sf'].mean(axis=0) # 可视化 plt.figure(figsize=(10,6)) celltype_proportions.sort_values().plot(kind='barh') plt.title('Tumor Microenvironment Cell Composition') plt.xlabel('Relative Abundance')

6.2 时间序列分析

对于多个时间点的数据，可以追踪细胞类型分布的动态变化：

# 按时间点分组计算 time_points = adata_vis.obs['time_point'].unique() results = [] for t in time_points: subset = adata_vis[adata_vis.obs['time_point'] == t] means = subset.obsm['means_cell_abundance_w_sf'].mean(axis=0) results.append(means) # 创建变化趋势图 trend_df = pd.DataFrame(results, index=time_points) trend_df.plot(marker='o', figsize=(12,6)) plt.title('Cell Type Dynamics Over Time') plt.ylabel('Abundance')

6.3 结果生物学解读

最后，我们需要将计算结果转化为生物学见解。例如：

免疫细胞共定位：如果T细胞和B细胞在特定区域共现，可能提示三级淋巴结构的形成
基质细胞分布：成纤维细胞的空间分布模式可能反映组织纤维化程度
恶性细胞浸润：肿瘤边界处恶性细胞的扩散模式可以评估侵袭性

# 计算细胞类型共现 from sklearn.metrics import pairwise_distances co_occurrence = 1 - pairwise_distances( adata_vis.obsm['means_cell_abundance_w_sf'].T, metric='cosine' ) sns.clustermap(co_occurrence, cmap='viridis') plt.title('Cell Type Co-occurrence Pattern')

通过这套完整的分析流程，我们不仅获得了细胞类型的空间分布图谱，还能深入挖掘组织微环境的结构特征和细胞间相互作用规律。

查看全文

http://www.jsqmd.com/news/521301/