当前位置：首页 > news >正文

利用Python脚本实现PubChem SID/CID到SMILES的批量映射与数据增强

news 2026/7/22 14:53:58

1. 为什么需要批量获取SMILES数据

在药物研发和生物信息学领域，SMILES字符串就像化合物的身份证号码。这种用ASCII字符表示分子结构的特殊语言，能让计算机快速理解化合物的结构特征。想象一下，你手头有5000个化合物的PubChem SID或CID编号，现在需要为每个化合物添加对应的SMILES信息——如果手动操作，不仅效率低下，还容易出错。

我去年参与的一个抗病毒药物筛选项目就遇到过这种情况。团队收集了3000多个潜在活性化合物的CID，但原始数据表缺少结构信息。当时如果手动处理，按每个化合物30秒计算，至少需要25小时连续工作。而用Python脚本批量处理，算上调试时间也只用了不到2小时。

SMILES数据对后续分析至关重要。比如在做分子对接时，需要SMILES来生成3D结构；构建QSAR模型时，SMILES是计算分子描述符的基础。没有这些结构信息，很多计算化学工作就无法开展。

2. 准备工作与环境配置

2.1 安装必要的Python库

工欲善其事，必先利其器。我们需要三个核心工具：

pandas：数据处理神器
requests：网络请求必备
tqdm（可选）：进度条显示

安装命令很简单：

pip install pandas requests tqdm

2.2 了解PubChem的API接口

PubChem提供了多种数据获取方式，我们主要用两种：

PUG REST API：适合中小规模数据查询（每次最多100个CID）
FTP批量下载：适合超大规模数据（整个数据库的子集）

这里有个实用技巧：先在浏览器测试API调用。比如访问这个链接：

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/property/CanonicalSMILES/JSON

能看到CID为2244的阿司匹林的SMILES数据。

3. 核心代码实现

3.1 单次查询函数

我们先从基础功能做起——根据CID获取SMILES：

import requests def get_smiles_by_cid(cid): base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug" try: response = requests.get( f"{base_url}/compound/cid/{cid}/property/CanonicalSMILES/JSON", timeout=10 ) return response.json()['PropertyTable']['Properties'][0]['CanonicalSMILES'] except Exception as e: print(f"Error fetching CID {cid}: {str(e)}") return None

这个函数有几个关键点：

设置了10秒超时，避免长时间等待
使用try-except捕获网络异常
返回None表示查询失败

3.2 批量查询优化

直接循环调用单次查询效率太低，我推荐使用PubChem的批量查询功能：

def get_batch_smiles(cid_list, chunk_size=100): base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug" all_results = {} for i in range(0, len(cid_list), chunk_size): chunk = cid_list[i:i + chunk_size] cid_string = ",".join(map(str, chunk)) try: response = requests.post( f"{base_url}/compound/cid/property/CanonicalSMILES/JSON", data={"cid": cid_string}, timeout=30 ) data = response.json() for item in data['PropertyTable']['Properties']: all_results[item['CID']] = item['CanonicalSMILES'] except Exception as e: print(f"Error in batch {i//chunk_size}: {str(e)}") return all_results

这里有几个优化点：

分批处理（默认每批100个CID）
使用POST请求避免URL过长
合并所有结果返回统一字典

4. 数据增强实战

4.1 处理原始数据文件

假设我们有个CSV文件"compounds.csv"，结构如下：

CID	Name	MW
2244	Aspirin	180.16

处理脚本如下：

import pandas as pd from tqdm import tqdm def enhance_data(input_file, output_file): df = pd.read_csv(input_file) # 确保CID列存在 if 'CID' not in df.columns: raise ValueError("CID column not found in input file") # 获取所有CID cid_list = df['CID'].unique().tolist() # 批量查询 print("Fetching SMILES from PubChem...") smiles_dict = get_batch_smiles(cid_list) # 添加SMILES列 df['SMILES'] = df['CID'].map(smiles_dict) # 保存结果 df.to_csv(output_file, index=False) print(f"Results saved to {output_file}") # 统计成功率 success_rate = df['SMILES'].notna().mean() print(f"Success rate: {success_rate:.2%}")

4.2 错误处理与重试机制

网络请求难免失败，完善的错误处理很重要。我改进后的版本包含：

自动重试失败请求
记录失败CID供后续处理
进度显示

def robust_get_smiles(cid_list, max_retries=3): results = {} failed_cids = [] for cid in tqdm(cid_list, desc="Processing CIDs"): for attempt in range(max_retries): try: smiles = get_smiles_by_cid(cid) if smiles: results[cid] = smiles break except Exception as e: if attempt == max_retries - 1: failed_cids.append(cid) if failed_cids: print(f"Failed to fetch {len(failed_cids)} CIDs") with open("failed_cids.txt", "w") as f: f.write("\n".join(map(str, failed_cids))) return results

5. 高级技巧与性能优化

5.1 多线程加速

当处理上万条记录时，单线程太慢。我用concurrent.futures实现了多线程版本：

from concurrent.futures import ThreadPoolExecutor def threaded_batch_query(cid_list, workers=8): with ThreadPoolExecutor(max_workers=workers) as executor: results = list(tqdm( executor.map(get_smiles_by_cid, cid_list), total=len(cid_list) )) return dict(zip(cid_list, results))

注意要点：

线程数建议4-8个，太多会被PubChem限制
使用tqdm显示进度
结果与输入CID列表保持对应

5.2 本地缓存策略

频繁查询相同化合物很浪费资源。我添加了本地缓存功能：

import json from pathlib import Path class SmilesCache: def __init__(self, cache_file="smiles_cache.json"): self.cache_file = Path(cache_file) self.cache = self._load_cache() def _load_cache(self): if self.cache_file.exists(): with open(self.cache_file) as f: return json.load(f) return {} def save_cache(self): with open(self.cache_file, "w") as f: json.dump(self.cache, f) def get_smiles(self, cid): cid = str(cid) if cid in self.cache: return self.cache[cid] smiles = get_smiles_by_cid(cid) if smiles: self.cache[cid] = smiles return smiles

使用方法：

cache = SmilesCache() smiles = cache.get_smiles(2244) # 首次查询网络 smiles = cache.get_smiles(2244) # 第二次从缓存读取 cache.save_cache() # 退出前保存

6. 完整项目结构

经过多次迭代，我将这个功能封装成了可复用的Python包，目录结构如下：

pubchem_smiles/ ├── __init__.py ├── api.py # 核心API调用 ├── cache.py # 缓存管理 ├── cli.py # 命令行接口 ├── utils.py # 工具函数 └── tests/ # 单元测试

典型使用方式：

from pubchem_smiles import SmilesEnhancer enhancer = SmilesEnhancer(cache_enabled=True) df = enhancer.enhance_dataframe(df, id_column="CID")

或者在命令行直接运行：

python -m pubchem_smiles.cli -i input.csv -o output.csv --cid-column CID

7. 常见问题解决

在实际使用中，我遇到过几个典型问题：

超时错误：增加timeout参数，添加重试逻辑
CID不存在：先用pubchem.get_compounds()验证CID有效性
API限流：添加延迟（如time.sleep(0.1)）避免频繁请求
SMILES格式不一致：统一使用CanonicalSMILES而非IsomericSMILES

特别提醒：PubChem的API有使用限制，非商业用途每分钟最多5个请求。如果需要大规模查询，建议：

申请API Key提升限额
使用FTP下载完整数据集
分批次处理，中间添加延迟

8. 扩展应用场景

这个技术方案不仅限于SMILES获取，稍作修改就能用于：

分子描述符计算：获取LogP、TPSA等物理化学性质
化合物分类：获取MeSH术语或ChEBI分类
交叉数据库映射：将PubChem CID转换为ChEMBL ID
文献关联：获取化合物相关的研究论文

比如要同时获取SMILES和分子量：

def get_multiple_properties(cid): url = f"https://pubchem.../property/MolecularWeight,CanonicalSMILES/JSON" response = requests.get(url) data = response.json() return { 'MW': data['PropertyTable']['Properties'][0]['MolecularWeight'], 'SMILES': data['PropertyTable']['Properties'][0]['CanonicalSMILES'] }