当前位置：首页 > news >正文

叶绿体基因组深度图还能这么看？用Python+R一键生成带结构注释的覆盖度报告

news 2026/7/26 0:20:58

叶绿体基因组深度分析自动化：Python+R全流程可视化实战

叶绿体基因组分析是植物基因组研究中的重要环节，而深度覆盖分析则是评估测序质量和组装完整性的关键指标。传统的手工分析方法效率低下且容易出错，特别是在处理大批量样本时。本文将介绍如何利用Python和R构建一个自动化流程，从原始深度数据到带有结构注释的专业级可视化报告，实现一键生成论文级图表的目标。

1. 环境准备与数据预处理

在开始自动化流程之前，需要确保系统环境配置正确并准备好输入数据。这一阶段的工作将为后续分析奠定基础。

1.1 软件环境配置

首先需要安装以下必备软件和Python/R包：

# 基础工具安装 conda create -n chloroplast python=3.8 r=4.1 conda activate chloroplast conda install -c bioconda bowtie2 samtools pip install rpy2 pandas numpy

R环境中需要安装的包可以通过以下命令完成：

install.packages(c("ggplot2", "data.table", "cowplot", "dplyr"))

1.2 输入数据规范

我们的自动化流程需要三类输入文件：

参考基因组：组装好的叶绿体基因组fasta文件
测序数据：原始测序文件（fastq格式）
结构注释信息：包含LSC、SSC、IRa、IRb区域坐标的文本文件

结构注释文件示例（regions.txt）：

LSC 1 84846 IRb 84847 110900 SSC 110901 129191 IRa 129192 155245

2. 深度数据自动化处理

深度数据的处理是分析流程的核心环节，本节将介绍如何使用Python实现深度文件的自动化处理和统计计算。

2.1 序列比对与深度计算

使用bowtie2和samtools进行序列比对和深度计算：

import subprocess def calculate_depth(reference, forward_read, reverse_read, output_prefix, threads=20): # 构建索引 subprocess.run(f"bowtie2-build {reference} {output_prefix}_ref", shell=True) # 序列比对 subprocess.run( f"bowtie2 --very-sensitive-local -x {output_prefix}_ref " f"-1 {forward_read} -2 {reverse_read} -p {threads} " f"-S {output_prefix}.sam", shell=True ) # SAM转BAM并排序 subprocess.run(f"samtools view -h -F 4 {output_prefix}.sam -@ {threads} | " f"samtools sort -o {output_prefix}_sorted.bam -@ {threads}", shell=True) # 生成深度文件 subprocess.run(f"samtools depth {output_prefix}_sorted.bam > {output_prefix}_depth.txt", shell=True) return f"{output_prefix}_depth.txt"

2.2 深度数据步长合并

为提高可视化效果，我们需要对原始深度数据进行步长合并：

import pandas as pd def merge_depth_by_window(depth_file, window_size=2000, output_file=None): """按指定窗口大小合并深度数据""" df = pd.read_csv(depth_file, sep='\t', header=None, names=['contig', 'pos', 'depth']) # 计算窗口平均值 result = [] for i in range(0, len(df), window_size): window = df.iloc[i:i+window_size] middle_pos = window['pos'].median() avg_depth = window['depth'].mean() result.append([middle_pos, avg_depth]) merged_df = pd.DataFrame(result, columns=['position', 'depth']) if output_file: merged_df.to_csv(output_file, sep='\t', index=False) return merged_df

提示：窗口大小可根据基因组大小调整，对于较小的叶绿体基因组(约150kb)，2000bp的窗口通常效果良好。

3. 基因组结构整合与可视化

将深度数据与基因组结构信息结合，生成专业级可视化图表是本节的重点内容。

3.1 结构信息解析

首先需要解析基因组结构信息：

def parse_region_file(region_file): """解析基因组结构区域文件""" regions = {} with open(region_file) as f: for line in f: parts = line.strip().split() regions[parts[0]] = (int(parts[1]), int(parts[2])) return regions

3.2 R可视化脚本生成

使用rpy2直接在Python中调用R进行可视化：

from rpy2.robjects import pandas2ri from rpy2.robjects.packages import importr import rpy2.robjects as ro def plot_depth_with_regions(depth_data, merged_data, regions, output_plot): """生成带有结构注释的深度图""" pandas2ri.activate() ggplot2 = importr('ggplot2') cowplot = importr('cowplot') # 准备R环境 ro.r.assign('depth_data', depth_data) ro.r.assign('merged_data', merged_data) ro.r.assign('regions', regions) ro.r.assign('output_plot', output_plot) r_script = """ library(ggplot2) library(cowplot) # 准备区域数据 regions_df <- data.frame( region = names(regions), start = sapply(regions, function(x) x[1]), end = sapply(regions, function(x) x[2]), color = c("lightblue", "lightgreen", "lightpink", "lightgray"), border = c("blue", "green", "red", "black") ) # 原始深度图 p1 <- ggplot(depth_data, aes(x = V2, y = V3)) + geom_bar(stat = 'identity', fill = "lightblue") + ylim(0, max(depth_data$V3) * 1.1) + theme_classic() + xlab("Genome Position (bp)") + ylab("Read Depth") + geom_rect( data = regions_df, aes(xmin = start, xmax = end, ymin = 0, ymax = max(depth_data$V3) * 0.02), fill = color, colour = border, size = 0.5, inherit.aes = FALSE ) # 合并窗口深度图 p2 <- ggplot(merged_data, aes(x = position, y = depth)) + geom_bar(stat = 'identity', width = 800, fill = "lightblue") + ylim(0, max(merged_data$depth) * 1.1) + theme_classic() + xlab("Genome Position (bp)") + ylab("Mean Depth (2000bp window)") + geom_rect( data = regions_df, aes(xmin = start, xmax = end, ymin = 0, ymax = max(merged_data$depth) * 0.02), fill = color, colour = border, size = 0.5, inherit.aes = FALSE ) # 合并图形 final_plot <- plot_grid(p1, p2, nrow = 2, labels = c("A", "B")) # 保存图形 ggsave(output_plot, final_plot, width = 10, height = 8, dpi = 300) """ ro.r(r_script)

4. 批量处理与报告生成

对于大规模分析项目，批量处理能力至关重要。本节将介绍如何扩展流程以处理多个样本。

4.1 样本元数据管理

建议使用CSV文件管理样本信息：

sample_id,reference,forward_read,reverse_read,region_file sample1,ref.fasta,sample1_R1.fq.gz,sample1_R2.fq.gz,regions.txt sample2,ref.fasta,sample2_R1.fq.gz,sample2_R2.fq.gz,regions.txt

4.2 批量处理脚本

import pandas as pd from multiprocessing import Pool def process_sample(row): """处理单个样本的完整流程""" sample_id = row['sample_id'] print(f"Processing {sample_id}...") # 计算深度 depth_file = calculate_depth( row['reference'], row['forward_read'], row['reverse_read'], sample_id ) # 合并窗口 merged_file = f"{sample_id}_merged.txt" merge_depth_by_window(depth_file, output_file=merged_file) # 解析区域 regions = parse_region_file(row['region_file']) # 可视化 plot_file = f"{sample_id}_depth_plot.png" depth_data = pd.read_csv(depth_file, sep='\t', header=None) merged_data = pd.read_csv(merged_file, sep='\t') plot_depth_with_regions(depth_data, merged_data, regions, plot_file) return plot_file def batch_process(metadata_file, threads=4): """批量处理多个样本""" metadata = pd.read_csv(metadata_file) with Pool(threads) as p: results = p.map(process_sample, [row for _, row in metadata.iterrows()]) print(f"Processing completed. Results saved to: {results}")

4.3 HTML报告生成

使用Python生成包含所有样本结果的HTML报告：

from jinja2 import Template def generate_html_report(sample_results, output_file="report.html"): """生成HTML格式的报告""" template_str = """ <!DOCTYPE html> <html> <head> <title>叶绿体基因组深度分析报告</title> <style> body { font-family: Arial, sans-serif; margin: 20px; } h1 { color: #2e6c80; } .sample { margin-bottom: 30px; border-bottom: 1px solid #ccc; padding-bottom: 20px; } img { max-width: 100%; height: auto; } </style> </head> <body> <h1>叶绿体基因组深度分析报告</h1> <p>生成日期: {{ date }}</p> <p>共分析 {{ count }} 个样本</p> {% for sample in samples %} <div class="sample"> <h2>样本: {{ sample.id }}</h2> <img src="{{ sample.plot }}" alt="深度分布图"> </div> {% endfor %} </body> </html> """ template = Template(template_str) html = template.render( date=pd.Timestamp.now().strftime("%Y-%m-%d"), count=len(sample_results), samples=[{"id": f"sample{i+1}", "plot": res} for i, res in enumerate(sample_results)] ) with open(output_file, 'w') as f: f.write(html)

5. 高级功能与定制化

为满足不同研究需求，我们可以进一步扩展基础流程的功能。

5.1 深度分布统计

添加深度分布统计功能：

def depth_distribution_stats(depth_file): """计算深度分布统计量""" df = pd.read_csv(depth_file, sep='\t', header=None, names=['contig', 'pos', 'depth']) stats = { 'mean_depth': df['depth'].mean(), 'median_depth': df['depth'].median(), 'min_depth': df['depth'].min(), 'max_depth': df['depth'].max(), 'coverage_10x': (df['depth'] >= 10).mean() * 100, 'coverage_30x': (df['depth'] >= 30).mean() * 100 } return stats

5.2 区域特异性深度分析

针对不同基因组区域进行深度比较：

def region_specific_depth(depth_file, region_file): """计算各区域的深度统计""" depth_df = pd.read_csv(depth_file, sep='\t', header=None, names=['contig', 'pos', 'depth']) regions = parse_region_file(region_file) results = [] for region, (start, end) in regions.items(): region_depth = depth_df[(depth_df['pos'] >= start) & (depth_df['pos'] <= end)]['depth'] results.append({ 'region': region, 'mean_depth': region_depth.mean(), 'median_depth': region_depth.median(), 'coverage_10x': (region_depth >= 10).mean() * 100 }) return pd.DataFrame(results)

5.3 交互式可视化

使用plotly创建交互式可视化：

def interactive_depth_plot(depth_data, merged_data, regions): """生成交互式深度图""" import plotly.graph_objects as go from plotly.subplots import make_subplots fig = make_subplots(rows=2, cols=1, subplot_titles=("Raw Depth", "Window Average (2000bp)")) # 原始深度 fig.add_trace( go.Bar(x=depth_data['pos'], y=depth_data['depth'], name='Raw Depth'), row=1, col=1 ) # 窗口平均 fig.add_trace( go.Bar(x=merged_data['position'], y=merged_data['depth'], name='Window Average'), row=2, col=1 ) # 添加区域标记 colors = {'LSC': 'blue', 'IRb': 'green', 'SSC': 'red', 'IRa': 'gray'} for region, (start, end) in regions.items(): fig.add_vrect( x0=start, x1=end, fillcolor=colors[region], opacity=0.2, line_width=1, line_color=colors[region], row=1, col=1 ) fig.add_vrect( x0=start, x1=end, fillcolor=colors[region], opacity=0.2, line_width=1, line_color=colors[region], row=2, col=1 ) fig.update_layout(height=800, showlegend=False) return fig

查看全文

http://www.jsqmd.com/news/934662/