当前位置：首页 > news >正文

利用p-IgGen构建抗体可开发性预测模型指南

news 2026/4/29 3:38:41

1. 抗体可开发性预测模型构建指南

在生物制药领域，抗体药物的开发是一个复杂且昂贵的过程。传统实验方法需要耗费数月时间评估候选抗体的表达量、稳定性和聚集倾向等关键参数。今天我要分享的，是如何利用蛋白质语言模型p-IgGen的嵌入特征，快速构建预测抗体可开发性的机器学习模型。这个方法可以将原本需要数周的实验评估缩短到几分钟的计算预测。

2. 核心概念与技术选型

2.1 抗体序列的基础结构

抗体由重链(VH)和轻链(VL)组成，两者通过二硫键连接形成Y型结构。VH和VL的配对决定了抗体的抗原结合特性，同时也影响着其生物物理性质。在我们的数据集中，每个样本都包含配对的VH和VL氨基酸序列。

关键提示：VH和VL的协同作用非常重要。单独优化某一条链可能破坏两者的配对效果，导致整体结构不稳定。

2.2 可开发性评估指标

GDPa1数据集提供了五种实验测量指标：

Titer：哺乳动物细胞中的抗体表达量
HIC：疏水相互作用色谱保留时间，反映疏水性和聚集倾向
PR_CHO：中国仓鼠卵巢细胞中的多反应性
Tm2：CH2结构域的热稳定性（熔解温度）
AC-SINS_pH7.4：自相互作用倾向性

2.3 模型架构选择

我们采用p-IgGen作为基础模型，原因有三：

专为抗体序列设计，比通用蛋白质语言模型更专业
能同时处理VH和VL的配对序列
预训练时已学习到抗体特有的序列-功能关系

3. 数据准备与特征工程

3.1 数据加载与清洗

from datasets import load_dataset import pandas as pd # 加载数据集 df = load_dataset("ginkgo-datapoints/GDPa1")["train"].to_pandas() # 检查缺失值 print(df[["Titer","HIC","PR_CHO","Tm2",'AC-SINS_pH7.4']].isna().sum()) # 选择目标指标并删除缺失值 target = "HIC" df = df.dropna(subset=[target])

3.2 序列预处理技巧

抗体序列需要特殊处理才能输入p-IgGen模型：

在序列开头添加"1"作为起始标记
将VH和VL用空格连接
在序列末尾添加"2"作为终止标记

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("ollieturnbull/p-IgGen") sequences = [ "1" + " ".join(heavy) + " ".join(light) + "2" for heavy, light in zip( df["vh_protein_sequence"], df["vl_protein_sequence"] ) ]

4. 模型训练与评估

4.1 嵌入特征提取

使用GPU可以显著加速特征提取过程。对于242条序列，CPU需要约60秒，而GPU仅需1.1秒。

import torch from tqdm.auto import tqdm device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModelForCausalLM.from_pretrained("ollieturnbull/p-IgGen").to(device) batch_size = 16 mean_pooled_embeddings = [] for i in tqdm(range(0, len(sequences), batch_size)): batch = tokenizer( sequences[i:i+batch_size], return_tensors="pt", padding=True, truncation=True ) outputs = model( batch["input_ids"].to(device), return_rep_layers=[-1], output_hidden_states=True ) embeddings = outputs["hidden_states"][-1].detach().cpu().numpy() mean_pooled_embeddings.append(embeddings.mean(axis=1))

4.2 岭回归模型训练

我们选择岭回归而非普通线性回归，因为它能更好地处理特征间的共线性问题。

from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split X = np.concatenate(mean_pooled_embeddings) y = df[target].values X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) lm = Ridge() lm.fit(X_train, y_train) y_pred = lm.predict(X_test)

4.3 评估指标选择

使用Spearman秩相关系数而非Pearson相关系数，因为：

不假设线性关系
对异常值更稳健
更适合生物数据常见的非线性关系

from scipy.stats import spearmanr import matplotlib.pyplot as plt import seaborn as sns rho = spearmanr(y_pred, y_test).statistic sns.scatterplot(x=y_test, y=y_pred) plt.title(f"预测值与真实值对比\nSpearman相关系数: {rho:.2f}") plt.xlabel(f"真实{target}") plt.ylabel(f"预测{target}") plt.show()

5. 高级验证策略

5.1 同型分层交叉验证

普通随机划分会导致模型在测试集上表现被高估，因为训练集和测试集可能包含高度相似的抗体序列。我们采用基于抗体簇和同型的分层交叉验证：

fold_col = "hierarchical_cluster_IgG_isotype_stratified_fold" fold_values = df[fold_col].to_numpy() unique_folds = [f for f in np.unique(fold_values) if f == f] # 去除NaN per_fold_stats = [] y_pred_all = np.full(len(df), np.nan) y_true_all = np.full(len(df), np.nan) for f in unique_folds: test_idx = np.where(fold_values == f)[0] train_idx = np.where(fold_values != f)[0] X_train, y_train = X[train_idx], y[train_idx] X_test, y_test = X[test_idx], y[test_idx] lm = Ridge() lm.fit(X_train, y_train) y_pred = lm.predict(X_test) y_pred_all[test_idx] = y_pred y_true_all[test_idx] = y_test rho = spearmanr(y_test, y_pred).statistic per_fold_stats.append((int(f), rho, len(y_test))) mask = ~np.isnan(y_true_all) overall_rho = spearmanr(y_true_all[mask], y_pred_all[mask]).statistic

5.2 同型效应处理技巧

不同IgG亚型（IgG1、IgG2、IgG4）的CH2结构域稳定性存在系统性差异。建议：

将同型信息作为额外特征加入模型
在同型内部进行标准化
使用同型分层抽样确保训练/测试集分布一致

6. 实际应用与提交

6.1 测试集预测生成

testset_df = pd.read_csv('heldout-set-sequences.csv') testset_sequences = [ "1" + " ".join(heavy) + " ".join(light) + "2" for heavy, light in zip( testset_df["vh_protein_sequence"], testset_df["vl_protein_sequence"] ) ] testset_embeddings = [] for i in tqdm(range(0, len(testset_sequences), batch_size)): batch = tokenizer( testset_sequences[i:i+batch_size], return_tensors="pt", padding=True, truncation=True ) outputs = model( batch["input_ids"].to(device), return_rep_layers=[-1], output_hidden_states=True ) embeddings = outputs["hidden_states"][-1].detach().cpu().numpy() testset_embeddings.append(embeddings.mean(axis=1)) testset_embeddings = np.concatenate(testset_embeddings) testset_y = lm.predict(testset_embeddings)

6.2 结果提交格式

提交文件应包含抗体名称、VH/VL序列和预测值：

testset_submission = testset_df[ ['antibody_name', 'vh_protein_sequence', 'vl_protein_sequence'] ].copy() testset_submission[target] = testset_y testset_submission.to_csv('testset_submission.csv', index=False)