当前位置：首页 > news >正文

技术避坑（一）：MetaPhlan 4和StrainPhlan 4联用分析菌株水平的传递

news 2026/7/5 15:00:55

写在前面的话

博主的研究内容主要为人群间细菌的传递模式，研究对象包括母-婴传递、家庭内传递以及群体内传递，相关纵向或者横向研究以国外居多，在国内较为罕见。因此博主借助个人研究的内容，利用自己微薄的知识，填补空白。望各位读者轻喷，随时与博主交流。

StrainPhlan的官方介绍

StrainPhlAn - A strain-level population genomics tool for metagenomic data提供了运行流程和代码，但大部分同学会因为GitHub无法进入难以获取完整代码，有兴趣的读者可以尝试修改自己的Hosts文件，直接访问GitHub。不说废话了，继续流程。

StrainPhlAn通过识别MetaPhlAn标记基因中的单核苷酸变异（SNV），以分析菌株级遗传变异。它通过对多个样本进行分析，使研究人员能够追踪不同环境或宿主间的特定菌株，检查菌株水平的种群结构，并在菌株水平上构建系统发育树。（原文介绍）

图 StrainPhlAn的基本分析流程

安装Metaphlan 4并配置数据库，运行StrainPhlan 4

（提示：在此跳过原始数据质控和去宿主阶段，默认各位读者已完成）

StrainPhlan 4要求至少需要四个样本数据，主要使用Metaphlan 4跑完的*.sam.bz2文件（经过 bzip2 压缩的 SAM 格式序列比对文件，关键参数为 -s metaphlan/${i}.sam.bz2），跑完metaphlan4以后，将.sam.bz2文件汇总到一个文件夹里。同时，读者可以根据每个样本中哪些 SGB（物种）存在、丰度多少，筛选出要做菌株分析的目标物种。

博主的研究对象主要是母亲及其婴儿，是一组纵向数据，主要尝试解析母婴之间长双歧杆菌长亚种（Bifidobacterium longumsubsp.longum）的菌株传递，因此使用了发表于2022年Nature communications的《Longitudinal quantification ofBifidobacterium longumsubsp.infantisreveals late colonization in the infant gut independent of maternal milk HMO composition》中一个特定数据库，读者也可以在metaphlan 4提供的数据库网站下载并配置通用数据库（后面会重点讲一个大Bug）。

下载地址：ftp: Index of /biobakery4/metaphlan_databases

###使用通用数据库 for i in `tail -n+2 metadata.txt | cut -f1 `; do metaphlan --input_type fastq temp1/${i}_R1_val_1.fq,temp1/${i}_R2_val_2.fq \ --bowtie2db ./metaphlan_databases \ --bowtie2out metaphlan_infantis/${i}_metagenome.bowtie2.bz2 \ -s metaphlan/${i}.sam.bz2 \ --output metaphlan_infantis/${i}.txt; done ###使用特定自建数据库 # for i in `tail -n+2 metadata.txt | cut -f1 `; do # metaphlan --input_type fastq temp1/${i}_R1_val_1.fq,temp1/${i}_R2_val_2.fq \ # --index mpa_vOct22_CHOCOPhlAnSGB_lon_subsp \ # --bowtie2db ./meta/MetaPhlAn-B.infantis-main \ # --bowtie2out metaphlan_infantis/${i}_metagenome.bowtie2.bz2 \ # -s metaphlan/${i}.sam.bz2 \ # --output metaphlan_infantis/${i}.txt; # done ####将所有.sam.bz2文件复制出来，并解压（也可以不解压，软件可以直接读取，但是博主有强迫症） for i in `tail -n+2 metadata.txt | cut -f1 `; do cp ./metaphlan/${i}.sam.bz2 ./strainphlan/sams/${i}.sam.bz2 done

提取特定SGB的标记基因

完成上述步骤后，需要运行StrainPhlan 4，从样本中提取特定SGB的标记基因并进行建树，之后从数据库中提取具体SGB的markers

####从样本中提取markers #### 读者可替换通用数据库-{db} sample2markers.py -d ./db -i sams/*.sam.bz2 -o consensus_markers --nproc 64 ###从数据库中提取markers ###可以从MetaPhlan 4的生成结果中检索SGB的标准名称 ### s__表示物种水平 extract_markers.py -d ./meta/MetaPhlAn-B.infantis-main/mpa_vOct22_CHOCOPhlAnSGB_lon_subsp.pkl -c s__Bifidobacterium_longum -o clade_markers ###博主分析的是长亚种 ###extract_markers.py -d /home/shzuZQH/zqh_data/track_data/meta/MetaPhlAn-B.infantis-main/mpa_vOct22_CHOCOPhlAnSGB_lon_subsp.pkl -c t__subsp.longum -o clade_markers

一个令人头疼的Bug

接下来，运行StrainPhlan4主流程，最关键、最精髓的部分来了，接着往下看：

-d 表示特定数据库，这一参数极其关键，如若读者均使用自定数据库，但是没有对应的.txt.bz2文件，就会出现以下报错：

FileNotFoundError: [Errno 2] No such file or directory: '/home/zxy/miniconda3/envs/metaphlan4/lib/python3.9/site-packages/metaphlan/utils/mpa_vOct22_CHOCOPhlAnSGB_lon_subsp_size.txt.bz2'

这个.txt.bz2文件是每个对应数据库中对应SGB 的 marker gene数量，为一个分隔符分隔的txt文件，第一列为SGB号或者SGB分类名，第二列为每对应SGB的marker gene数量，比如：

这一步骤博主在跑的时候相当痛苦，因为使用了自建数据库，并未生成对应的.txt.bz2文件，而Metaphlan 4在配置数据库时可能并不会主动提供此配套文件，可在/home/zxy/miniconda3/envs/metaphlan4/lib/python3.9/site-packages/metaphlan/utils/目录中查看有那些数据库版本的文件，比如：

如若没有对应数据库文件，读者可以自己用AI写一个脚本，从对应数据库的.pkl文件中提取SGB号及其对应的marker gene数量，比如博主就从mpa_vOct22_CHOCOPhlAnSGB_lon_subsp.pkl提取并构建了mpa_vOct22_CHOCOPhlAnSGB_lon_subsp_size.txt.bz2文件，将其放入安装目录/home/zxy/miniconda3/envs/metaphlan4/lib/python3.9/site-packages/metaphlan/utils/中，代码即可正常运行。

-s 表示上一步生成的比对文件，StrainPhlan 4官方操作文档提示生成.pkl文件，但有时候会生成.json.baz2文件，不影响结果，可正常进行后续步骤

-m 上一步骤提取特定SGB的marker基因文件，一般只有一个.fna文件

-r 其他参考基因组，比如你有对应样本中分离得到的单菌，将其基因组文件添加进去，可以提高分析准确率，但需要注意将.fna文件转换成.fna.bz2文件（bzip2 ref_genome.txt）

-c 表示指定分析的目标物种（SGB）

--phylophlan_mode fast 表示PhyloPhlAn 的运行模式：fast= 只用标记基因的核心区域，速度快

其他参数读者可依据需要进行调整

StrainPhlan 4主流程

####运行StrainPhlan 4主要流程 strainphlan -d ./meta/MetaPhlAn-B.infantis-main/mpa_vOct22_CHOCOPhlAnSGB_lon_subsp.pkl -s consensus_markers/* -m clade_markers/s__Bifidobacterium_longum.fna -r reference_genomes/*.fna.bz2 -o output -c s__Bifidobacterium_longum --phylophlan_mode fast --nproc 64

博主选择的Bifidobacterium longum有两个亚种，因此出现了选项。目前不知道其他有亚种的物种是否会出现相同选项，读者不必担心，正常进行即可

(metaphlan4) zxy@FMBL:/home/strainphlan$ strainphlan -d ./meta/MetaPhlAn-B.infantis-main/mpa_vOct22_CHOCOPhlAnSGB_lon_subsp.pkl -s consensus_markers/* -m clade_markers/s__Bifidobacterium_longum.fna -r reference_genomes/*.fna.bz2 -o output -c s__Bifidobacterium_longum --phylophlan_mode fast --nproc 64 Sun Jun 14 21:56:49 2026: Start StrainPhlAn 4.1.2 execution Sun Jun 14 21:56:49 2026: The clade has been specified at the species level, starting interactive clade selection... Sun Jun 14 21:56:49 2026: Loading MetaPhlAn mpa_vOct22_CHOCOPhlAnSGB_lon_subsp database... Sun Jun 14 21:57:07 2026: Done. Sun Jun 14 21:57:07 2026: Available SGBs for species "s__Bifidobacterium_longum": Sun Jun 14 21:57:07 2026: [1] t__subsp.longum (128 genomes) Sun Jun 14 21:57:07 2026: [2] t__subsp.infantis (119 genomes) Sun Jun 14 21:57:07 2026: [3] Exit Select option: ##输入选项数字后即可继续运行

完成运行

完成以后，会生成如下文件，博主选择了Bifidobacterium longumsubsp.infantis进行分析。具体的文件说明，读者可在StrainPhlan 4官网中找寻。

(metaphlan4) zxy@FMBL:/home/strainphlan/output$ ls RAxML_bestTree.t__subsp.infantis.StrainPhlAn4.tre RAxML_info.t__subsp.infantis.StrainPhlAn4.tre RAxML_log.t__subsp.infantis.StrainPhlAn4.tre RAxML_parsimonyTree.t__subsp.infantis.StrainPhlAn4.tre RAxML_result.t__subsp.infantis.StrainPhlAn4.tre t__subsp.infantis.info t__subsp.infantis.polymorphic t__subsp.infantis.StrainPhlAn4_concatenated.aln

输出文件：

Output Type	Description	Format
Marker Files	Extracted marker sequences	FASTA/PKL
Alignments	Multiple sequence alignments	FASTA
Phylogenetic Trees	Strain-level phylogeny	Newick (TRE)
Distance Matrices	Pairwise distances	Tab-delimited
Visualizations	Tree and ordination plots	PNG/PDF

结果解读

使用StrainPhlan 4最主要的分析结果就是进化树文件，可以分析群体或者样本间特定SGB菌株的亲缘关系，比如下图文件展示了不同国家研究队列中发现的Bifidobacterium longumsubsp.infantis菌株的系统发育树，以此研究不同国家之间婴儿双歧杆菌菌株的差异。每个样本取其主导菌株，并按对应队列进行颜色编列。结果显示，部分国家的所有样本中菌株高度相似。

也有其他研究者研究不同家庭内部成员、母婴或者双胞胎之间的两歧双歧杆菌（Bifidobacterium bifidum，SGB17256）菌株的遗传多样性。

总体而言，StrainPhlan 4的整体流程相较于inStrain还是简单一些，针对特定物种或者SGB的精度更高，但是缺点就是只能一个！一个！一个！去分析（哭），对于博主这种需要研究大型队列、数百个物种的人来说简直是灾难。最后想说，此流程适用于有宏基因组基础分析的研究人员，简化了中间许多步骤，如有需求或者讨论可尽情留言。

主要参考文献

[1] Royo S M ,Dubois L ,Manara S , et al. Birthmode and environment-dependent microbiota transmission dynamics are complemented by breastfeeding during the first year.[J].Cell host & microbe,2024,32 (6):996-1010.e4.DOI:10.1016/J.CHOM.2024.05.005.
[2] Ennis D ,Shmorak S ,Krenn J E , et al. Longitudinal quantification ofBifidobacterium longumsubsp.infantisreveals late colonization in the infant gut independent of maternal milk HMO composition.[J].Nature communications,2024,15 (1):894-894.DOI:10.1038/S41467-024-45209-Y.
[3] Mireia V ,Aitor B ,Paolo M , et al. The person-to-person transmission landscape of the gut and oral microbiomes.[J].Nature,2023,614 (7946):125-135.DOI:10.1038/S41586-022-05620-1.

查看全文

http://www.jsqmd.com/news/1128866/