当前位置：首页 > news >正文

当两个分布的0值具有特殊物理意义，怎么进行对齐 ?

news 2026/3/27 3:29:10

通常，当数据的 0值具有特殊物理意义（例如：0表示无反应，正负表示相反的效果）时，我们不能简单地进行全局缩放，因为那可能会导致0点漂移。

需要以 0 为锚点，分别拉伸：

负半轴部分：将蛋白质的负值最小值（Lower Bound）拉伸至 DNA 的负值最小值。
正半轴部分：将蛋白质的正值最大值（Upper Bound）拉伸至 DNA 的正值最大值。

这种方法叫 “以零为锚点的分段线性缩放” (Zero-Anchored Segmented Rescaling)。

\[x_{new} = \begin{cases} x \times \frac{Min_{DNA}}{Min_{Protein}} & \text{if } x < 0 \\ x \times \frac{Max_{DNA}}{Max_{Protein}} & \text{if } x \ge 0 \end{cases} \]

核心算法逻辑

以下是实现代码：

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np# 假设路径 (请替换为实际路径)
# output_path = 'your_dna_file.parquet'
# input_path = 'your_protein_file.parquet'sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 7))# ---------------------------------------------------------
# 1. 读取数据
# ---------------------------------------------------------
df_dna = pd.read_parquet(output_path)
df_protein = pd.read_parquet(input_path)# ---------------------------------------------------------
# 2. 计算边界 (Bounds)
# ---------------------------------------------------------
# DNA 的边界
dna_min = df_dna['score'].min()
dna_max = df_dna['score'].max()
# 此时我们需要分别获取 DNA 在 0 轴两侧的极值，用于作为对齐目标
dna_neg_min = df_dna[df_dna['score'] < 0]['score'].min()
dna_pos_max = df_dna[df_dna['score'] >= 0]['score'].max()# Protein 的边界
prot_neg_min = df_protein[df_protein['score'] < 0]['score'].min()
prot_pos_max = df_protein[df_protein['score'] >= 0]['score'].max()print("=== 边界统计 ===")
print(f"DNA     负极值: {dna_neg_min:.4f} | 正极值: {dna_pos_max:.4f}")
print(f"Protein 负极值: {prot_neg_min:.4f} | 正极值: {prot_pos_max:.4f}")# ---------------------------------------------------------
# 3. 分段迁移 (Segmented Transfer)
# ---------------------------------------------------------
# 计算缩放因子 (Scaling Factors)
# 负半轴缩放比例：目标负极值 / 源负极值
scale_neg = dna_neg_min / prot_neg_min if prot_neg_min != 0 else 1.0
# 正半轴缩放比例：目标正极值 / 源正极值
scale_pos = dna_pos_max / prot_pos_max if prot_pos_max != 0 else 1.0print("\n=== 缩放因子 ===")
print(f"负半轴缩放 (x < 0) : {scale_neg:.4f} 倍")
print(f"正半轴缩放 (x >= 0): {scale_pos:.4f} 倍")def apply_segmented_scaling(x):if x < 0:return x * scale_negelse:return x * scale_pos# 应用变换
df_protein['aligned_score'] = df_protein['score'].apply(apply_segmented_scaling)# ---------------------------------------------------------
# 4. 绘图
# ---------------------------------------------------------# 绘制 DNA (Target) - 红色背景
sns.histplot(df_dna['score'], color='tab:red', label='Target: DNA', kde=True, stat="density", alpha=0.2, element="step", linewidth=0
)# 绘制 原始 Protein (Source) - 蓝色背景
sns.histplot(df_protein['score'], color='tab:blue', label='Source: Protein (Original)', kde=True, stat="density", alpha=0.2, element="step", linewidth=0
)# 绘制 迁移后 Protein (Aligned) - 绿色实线
sns.histplot(df_protein['aligned_score'], color='tab:green', label='Transformed: Protein (Bound-Aligned)', kde=True, stat="density", alpha=0.4, element="step", linewidth=1.5, linestyle='-', fill=False # 只画轮廓更清晰
)# ---------------------------------------------------------
# 5. 添加辅助线验证边界对齐
# ---------------------------------------------------------
# 0 点分割线
plt.axvline(0, color='black', linestyle='-', linewidth=1, alpha=0.5, label='Zero Anchor')# 负极值对齐线 (理论上 绿色 和 红色 的左边界应该重合)
plt.axvline(dna_neg_min, color='tab:red', linestyle=':', alpha=0.6, ymax=0.3)
plt.axvline(df_protein['aligned_score'].min(), color='tab:green', linestyle='--', alpha=0.6, ymax=0.3)
plt.text(dna_neg_min, 0.01, ' Neg Boundary', color='black', fontsize=9, ha='right')# 正极值对齐线 (理论上 绿色 和 红色 的右边界应该重合)
plt.axvline(dna_pos_max, color='tab:red', linestyle=':', alpha=0.6, ymax=0.3)
plt.axvline(df_protein['aligned_score'].max(), color='tab:green', linestyle='--', alpha=0.6, ymax=0.3)
plt.text(dna_pos_max, 0.01, 'Pos Boundary ', color='black', fontsize=9, ha='left')plt.title('Segmented Distribution Alignment (Zero-Anchored)', fontsize=16)
plt.xlabel('Score', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(loc='upper right')plt.show()