当前位置: 首页 > news >正文

利用公共数据控进行单细胞转录组学分析

基本分析框架: 

打开一个单细胞转录组资源(GSE条目),下载如下两项

image

 count matrix(表达矩阵)是单细胞转录组最核心的原始输入之一(.txt.gz文件直接可以读入R语言,不需要解压缩)

每一行 = 一个基因
每一列 = 一个细胞
每个数字 = 这个基因在这个细胞里被检测到的 UMI count / 表达计数

这类单细胞表达矩阵通常就是“基因 × 细胞”的 count matrix,里面的数值代表每个基因在每个细胞里的 reads 或 UMI 计数;UMI 比普通 read count 更能减少 PCR 扩增偏倚。

另一个 metadata 文件包含每个细胞的信息。

然后读入这两个文件,进行预处理:

setwd("路径")
# ===============================
library(data.table)
library(Seurat)
library(Matrix)
library(dplyr)
library(ggplot2)
library(patchwork)# ===============================
# 1. 读取 metadata
# ===============================
meta <- fread("GSE157474_Corneal_Limbus_cell_metadata.txt.gz",data.table = FALSE)head(meta)
colnames(meta)
table(meta$cluster)# 设置 rownames
rownames(meta) <- meta$cell# 这个数据里 count matrix 的细胞名是 .1,
# metadata 里是 -1,所以需要统一格式
meta$cell_count_name <- gsub("-1$", ".1", meta$cell)# ===============================
# 2. 只选 limbal epithelial cells
# ===============================
meta_epi <- meta %>%filter(cluster == "LEpC")length(meta_epi$cell_count_name)
head(meta_epi$cell_count_name)
# ===============================
# 3. 先读 count matrix 的表头
# ===============================
header <- names(fread("GSE157474_Corneal_Limbus_count_matrix.txt.gz",nrows = 0
))# 找到 LEpC 对应的列
cols_to_read <- c("gene", intersect(meta_epi$cell_count_name, header))length(cols_to_read)# ===============================
# 4. 只读取 gene + LEpC 细胞列
# ===============================
count_epi <- fread("GSE157474_Corneal_Limbus_count_matrix.txt.gz",select = cols_to_read,data.table = FALSE
)# 第一列是基因名
gene_names <- count_epi$gene
count_epi$gene <- NULLrownames(count_epi) <- gene_names# ===============================
# 5. 转为 matrix / sparse matrix
# ===============================
count_mat <- as.matrix(count_epi)
storage.mode(count_mat) <- "integer"count_sparse <- Matrix(count_mat, sparse = TRUE)# 删除大对象,释放内存
rm(count_epi, count_mat)
gc()# ===============================
# 6. 对齐 metadata
# ===============================
meta_epi2 <- meta_epi[match(colnames(count_sparse), meta_epi$cell_count_name), ]rownames(meta_epi2) <- meta_epi2$cell_count_name# 检查是否完全对齐
all(rownames(meta_epi2) == colnames(count_sparse))
# ===============================
# 7. 创建 Seurat 对象
# ===============================
epi <- CreateSeuratObject(
counts = count_sparse,
meta.data = meta_epi2,
project = "GSE157474_LEpC"
)epi

然后进行标准Seurat 流程

# 8. 标准 Seurat 流程
# ===============================
epi <- NormalizeData(epi)
epi <- FindVariableFeatures(epi, selection.method = "vst", nfeatures = 2000)
epi <- ScaleData(epi)
epi <- RunPCA(epi, npcs = 30)# 看 PCA elbow
ElbowPlot(epi, ndims = 30)

输出如下:

image

 这个 PCA elbow plot 是用来帮你决定:后面做聚类和 UMAP 时,应该用前多少个 PC。

横轴 PC = 第几个主成分
纵轴 Standard Deviation = 这个 PC 解释的数据变化量有多大
前面的 PC 解释的是真正主要的生物学差异,越往后越可能是细碎噪音。

以上面的图为例,大概是这样:

  • PC1 到 PC5:下降非常快
  • PC5 到 PC10:还在下降,但变缓
  • PC10 到 PC15:开始明显趋于平缓
  • PC15 后面:基本慢慢变平

所以这张图的 elbow 大概在:PC 10–15

因此后面这样:
# ===============================
# 9. 聚类 + UMAP
# ===============================
epi <- FindNeighbors(epi, dims = 1:15) #这里根据之前的PCA 肘~来定
epi <- FindClusters(epi, resolution = 0.3)
epi <- RunUMAP(epi, dims = 1:15)# ===============================
# 10. 画 UMAP
# ===============================
p1 <- DimPlot(epi, reduction = "umap", label = TRUE) +ggtitle("LEpC subclusters")p2 <- DimPlot(epi, reduction = "umap", group.by = "sample") +ggtitle("LEpC by sample")p1 + p2ggsave("Fig1_LEpC_UMAP_subclusters_sample.pdf",width = 12, height = 5)

Seurat 的 DimPlot() 本质上就是把降维后的细胞画成二维散点图:每一个点代表一个细胞,位置由表达谱相似性决定,颜色可以按 cluster 或 sample 分组

输出结果如下

image

 

左边是按 Seurat 重新聚类的结果。你这里把 LEpC 分成了 0–8 共 9 个亚群

可以这样理解:

图上 cluster含义
0–8 LEpC 内部不同表达状态的细胞群
每个点 一个 limbal epithelial cell
点越近 表达谱越相似
点越远 表达状态差异越大

右图是同一批细胞、同一个 UMAP 坐标,但颜色换成了样本来源

下面看看我们关注的不同基因簇在不同的Cluster中的表达

# ===============================
# 11. 分模块定义 marker
# ===============================marker_list <- list("LSC / basal progenitor" = c("KRT14", "KRT5", "KRT15", "TP63","ITGA6", "ITGB4", "ABCB5","GPHA2", "CXCL14", "IFITM3", "S100A2"),"Quiescence / progenitor-associated" = c("NOTCH1", "CAV1", "SLC6A6"),"TAC / cycling" = c("MKI67", "TOP2A", "PCNA", "MCM2", "MCM5", "STMN1"),"Corneal differentiation" = c("KRT3", "KRT12", "ALDH3A1", "ALDH1A1", "AQP5"),"Conjunctival / mucosal" = c("KRT13", "KRT19", "MUC1", "MUC4", "MUC16"),"Junction / adhesion" = c("CDH1", "TJP1", "CLDN1", "OCLN"),"Mechanotransduction / osmotic stress" = c("YAP1", "WWTR1", "CTGF", "CYR61", "NFAT5")
)# ===============================
#  展平成一个向量,并去重
# ===============================marker_genes <- unique(unlist(marker_list))# 只保留数据里存在的基因
marker_genes_use <- marker_genes[marker_genes %in% rownames(epi)]# 看看哪些基因不存在
setdiff(marker_genes, marker_genes_use)# 看看最终用于画图的基因
marker_genes_use# ===============================
# 12. DotPlot
# ===============================
p_dot <- DotPlot(epi,features = marker_genes_use,group.by = "seurat_clusters"
) +RotatedAxis() +ggtitle("Marker expression across LEpC subclusters") +theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9),axis.text.y = element_text(size = 10),plot.title = element_text(hjust = 0.5, face = "bold"))p_dotggsave(  "Fig2_LEpC_marker_DotPlot_expanded.pdf",p_dot,width = 18,height = 6)

输出如下,可以看到每一簇的细胞相应的基因的表达量,便于挑选自己想要的细胞簇:

image

根据这个结果定义每个细胞簇的可能类别,并且针对性开始分析差异基因

可以画一个带有名字的散点图

# ===============================
# 13. 给 cluster 改名字
# ===============================
new_cluster_ids <- c("0" = "Transitional/differentiating epi","1" = "Basal epithelial-like","2" = "Transitional epithelial-like","3" = "Basal/progenitor-like","4" = "Junction-high differentiating epi","5" = "Candidate LSC/basal progenitor","6" = "Conjunctival/mucosal-like epi","7" = "Progenitor-related transitional","8" = "Basal/progenitor or stress-like"
)
epi$cell_state <- plyr::mapvalues(x = as.character(epi$seurat_clusters),from = names(new_cluster_ids),to = as.vector(new_cluster_ids)
)DimPlot(epi,reduction = "umap",group.by = "cell_state",label = TRUE,repel = TRUE
) +ggtitle("Annotated LEpC subclusters")

这样子:

image

下面找出每个Cluster最特别的基因:

现在的 DotPlot 是“用我们预先选好的 marker 去看 cluster”。

也就是我们问:

我关心的这些基因在各 cluster 里高不高?

但是 FindAllMarkers() 是反过来问:

不带预设偏见,每个 cluster 自己最突出的基因是什么?

DotPlot 是“带着问题看”;FindAllMarkers 是“让数据自己告诉你每群的特征”。

代码如下:

# ===============================
# 14. 导出每个 cluster 的 top markers
# ===============================Idents(epi) <- "seurat_clusters"lep_markers <- FindAllMarkers(epi,only.pos = TRUE,min.pct = 0.25,logfc.threshold = 0.25)write.csv(lep_markers,"Table_LEpC_subcluster_markers.csv",row.names = FALSE)top20 <- lep_markers %>%dplyr::group_by(cluster) %>%dplyr::slice_max(order_by = avg_log2FC, n = 20)write.csv(top20,"Table_LEpC_top20_markers_each_cluster.csv",row.names = FALSE) 

 这一步相对耗时较长,它会一个簇一个簇来分析

image

 image

 然后可以对于不符合我们期望的Cluster进行一个排除(比如某细胞簇是基质细胞,而我们希望分析的群体是上皮细胞),再次进行UMAP分簇和DotPlot

然后用最准确的分簇,挑出最符合我们期待的细胞群,将它与其他符合预期的非目标群进行比较(例如有高度分化属性的干细胞和不同分化终末细胞)

 下面代码导出不同族群之间的差异基因:

# ===============================
# 15. 比较cluster 5 vs cluster 0 + 4 +6 
# ===============================
Idents(epi) <- "seurat_clusters"deg_c5_vs_diff <- FindMarkers(epi,ident.1 = "5",ident.2 = c("0", "4"),min.pct = 0.25,logfc.threshold = 0.25)write.csv(deg_c5_vs_diff,"DEG_cluster5_LSC_like_vs_cluster0_4_differentiated.csv")deg_c5_vs_c6 <- FindMarkers(epi,ident.1 = "5",ident.2 = "6",min.pct = 0.25,logfc.threshold = 0.25)write.csv(deg_c5_vs_c6,"DEG_cluster5_LSC_like_vs_cluster6_mucosal_like.csv")

 然后就可以看看差异基因里面有没有我们感兴趣的基因

image

 后面的分析就很个性化了,看个人的需要,分享一种基于UMAP的热点基因热图

# ===============================
# 1. 定义机制相关 gene signatures
# ===============================stemness_basal_genes <- c("KRT14", "KRT5", "KRT15", "TP63","ITGA6", "ITGB4", "ITGA3", "COL7A1","SLC6A6", "SLC5A3", "S100A2", "IFITM3","NOTCH1", "CAV1"
)osmotic_volume_genes <- c("NFAT5", "SLC5A3", "SLC6A6", "SLC38A1", "SLC38A2","SLC2A1", "AQP3", "AQP5","LRRC8A", "SLC12A2", "SLC12A6", "SLC9A1","HSPA1A", "HSPA1B", "HSP90AA1","DDIT3", "ATF3", "JUN", "FOS"
)mechanotransduction_genes <- c("ITGA3", "ITGA6", "ITGB1", "ITGB4","LAMA3", "LAMB3", "LAMC2", "COL7A1","CAV1", "CAV2","FLNA", "TLN1", "PXN", "VCL","ROCK1", "ROCK2", "MYL9","YAP1", "WWTR1", "TEAD1", "TEAD3","CYR61", "CTGF"
)stress_mapk_ap1_genes <- c("HSPA1A", "HSPA1B", "HSP90AA1", "DNAJB1","ATF3", "JUN", "JUNB", "FOS", "FOSB","DUSP1", "DUSP5", "DUSP6","EGR1", "DDIT3", "HMOX1","CXCL8", "IL6", "CCL20"
)junction_barrier_genes <- c("TJP1", "TJP2", "OCLN", "F11R","CDH1", "CLDN1", "CLDN4", "CLDN7","DSG1", "DSG3", "DSP", "JUP","EPCAM"
)differentiation_mucosal_genes <- c("KRT3", "KRT12", "ALDH3A1", "ALDH1A1","AQP5", "KRT13", "KRT19","MUC1", "MUC4", "MUC16","CEACAM5", "CEACAM7", "KRT4"
)# ===============================
# 2. 只保留数据里存在的基因
# ===============================keep_genes <- function(x) {x[x %in% rownames(epi)]
}stemness_basal_genes <- keep_genes(stemness_basal_genes)
osmotic_volume_genes <- keep_genes(osmotic_volume_genes)
mechanotransduction_genes <- keep_genes(mechanotransduction_genes)
stress_mapk_ap1_genes <- keep_genes(stress_mapk_ap1_genes)
junction_barrier_genes <- keep_genes(junction_barrier_genes)
differentiation_mucosal_genes <- keep_genes(differentiation_mucosal_genes)# 看看每组剩下多少基因
sapply(list(Stemness_basal = stemness_basal_genes,Osmotic_volume = osmotic_volume_genes,Mechanotransduction = mechanotransduction_genes,Stress_MAPK_AP1 = stress_mapk_ap1_genes,Junction_barrier = junction_barrier_genes,Differentiation_mucosal = differentiation_mucosal_genes),length
)
# ===============================
# 3. AddModuleScore 给每个Cluster的每个模块打分
# ===============================epi <- AddModuleScore(epi, features = list(stemness_basal_genes),name = "Stemness_basal_score")
epi <- AddModuleScore(epi, features = list(osmotic_volume_genes),name = "Osmotic_volume_score")
epi <- AddModuleScore(epi, features = list(mechanotransduction_genes),name = "Mechanotransduction_score")
epi <- AddModuleScore(epi, features = list(stress_mapk_ap1_genes),name = "Stress_MAPK_AP1_score")
epi <- AddModuleScore(epi, features = list(junction_barrier_genes),name = "Junction_barrier_score")
epi <- AddModuleScore(epi, features = list(differentiation_mucosal_genes),name = "Differentiation_mucosal_score")
#=============================
#画UMAP上面的SCORE
#=============================
p_feature_score <- FeaturePlot(epi,features = score_features,reduction = "umap",ncol = 3)p_feature_score
ggsave("Fig4_LEpC_module_scores_UMAP.pdf",p_feature_score,width = 14,height = 8)

 出来是这样 

image

 其实打完分之后呈现方式也可以是提琴图

# ===============================
# 4. Violin plot
# ===============================score_features <- c("Stemness_basal_score1","Osmotic_volume_score1","Mechanotransduction_score1","Stress_MAPK_AP1_score1","Junction_barrier_score1","Differentiation_mucosal_score1")p_score <- VlnPlot(epi,features = score_features,group.by = "seurat_clusters",pt.size = 0,ncol = 2)p_scoreggsave("Fig3_LEpC_osmo_mechano_stress_module_scores.pdf",p_score,width = 12,height = 10)

 出来是这样

image

 

http://www.jsqmd.com/news/700305/

相关文章:

  • 《SRE:Google 运维解密》读书笔记19: SRE中的软件工程 - 当SRE从“运维”走向“开发”
  • JOULWATT杰华特 JW1386VQDFA#TR DFN 转换器
  • 如何快速掌握PCL启动器:面向Minecraft新手的完整教程
  • 036、Python多线程编程:threading模块基础
  • Qwen3-TTS开源大模型部署:多用户并发语音合成负载测试报告
  • DeepSeek V4降AI完全手册,2026年4月从0到95分实测 - 我要发一区
  • Windows麦克风全局静音控制:MicMute的技术实现与高效应用指南
  • 儿童怎么掏耳朵?怎么给小孩掏耳屎?儿童掏耳朵神器推荐2026
  • HsMod插件:重新定义你的炉石传说游戏体验
  • MinGW-w64企业级技术架构深度解析:构建Windows生产环境部署的最佳实践
  • 如何用XUnity.AutoTranslator打破游戏语言壁垒:三步实现无缝翻译体验
  • 如何通过计算机视觉技术重新定义科研图表数据分析范式
  • 如何配置表中某列的排序权重_全文索引配置与权重分配
  • 破解近视低龄化难题 赵阳眼科以专业医疗守护青少年眼健康 - 外贸老黄
  • C++入门第一节
  • DeepSeek V4写的论文知网AI率高怎么办?2026年4月攻略 - 我要发一区
  • GitHub 9.5k Star!教你免费使用 Claude Code,终端 VSCode 皆可用
  • 在测试过程中,如何定位一个问题出现的原因
  • 5分钟掌握抖音下载器:新手必备的无水印批量下载完整指南
  • FlightSpy:如何用开源工具实现全天候机票价格智能监控?
  • Gemma-4-26B-A4B-it-GGUF效果展示:256K上下文下完整解析GitHub仓库README+源码逻辑
  • TIDAL Downloader Next Generation终极指南:解锁24-bit/192kHz无损音乐下载
  • 设计模式(学习笔记)(第二章,创建型模式)
  • 军队文职《管理学》| 组织行为学—刷题练习(40题精编)
  • 江西单招标杆机构,大圣学成教学成绩优异,成绩好,师资强,规模大,学成有保障 - 新闻快传
  • qiankun
  • FPGA音频处理平台Tiliqua的设计与应用
  • Linux入门攻坚——75、运维监控阶段工具之zabbix-2
  • Python3 模块精讲:Matplotlib—— 数据可视化、绘图从零基础到实战精通
  • 实测DeepSeek V4降AI 5款工具,2026年4月嘎嘎降AI最稳 - 我要发一区