当前位置: 首页 > news >正文

问答系统:从检索到生成式模型

问答系统:从检索到生成式模型

1. 技术分析

1.1 问答系统类型

问答系统可分为多种类型:

问答系统分类 检索式: 从知识库中检索答案 抽取式: 从文本中抽取答案片段 生成式: 直接生成答案 多模态: 结合文本和视觉

1.2 问答系统架构对比

类型架构特点代表模型
检索式TF-IDF/BM25简单快速Elasticsearch
抽取式BERT准确BERT-QA
生成式T5/GPT灵活T5-QA
多模态ViLT多模态ViLT-QA

1.3 QA 任务类型

QA 任务分类 SQuAD: 抽取式问答 HotpotQA: 多跳问答 TriviaQA: 开放域问答 VQA: 视觉问答

2. 核心功能实现

2.1 检索式问答

import torch import torch.nn as nn import numpy as np from rank_bm25 import BM25Okapi class RetrievalQA: def __init__(self, documents): self.documents = documents self.tokenized_docs = [doc.lower().split() for doc in documents] self.bm25 = BM25Okapi(self.tokenized_docs) def retrieve(self, query, top_k=5): tokenized_query = query.lower().split() scores = self.bm25.get_scores(tokenized_query) top_indices = np.argsort(scores)[::-1][:top_k] return [(self.documents[i], scores[i]) for i in top_indices] def answer(self, query, top_k=1): results = self.retrieve(query, top_k) return results[0][0] if results else None class DenseRetrieval(nn.Module): def __init__(self, model_name='bert-base-uncased'): super().__init__() from transformers import BertModel, BertTokenizer self.model = BertModel.from_pretrained(model_name) self.tokenizer = BertTokenizer.from_pretrained(model_name) def encode(self, texts): inputs = self.tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors='pt' ) outputs = self.model(**inputs) embeddings = outputs.last_hidden_state[:, 0, :] return embeddings def retrieve(self, query, documents, top_k=5): query_embedding = self.encode([query]) doc_embeddings = self.encode(documents) scores = torch.matmul(query_embedding, doc_embeddings.T).squeeze(0) top_indices = torch.argsort(scores, descending=True)[:top_k] return [(documents[i], scores[i].item()) for i in top_indices]

2.2 抽取式问答

class ExtractiveQA(nn.Module): def __init__(self, model_name='bert-base-uncased'): super().__init__() from transformers import BertForQuestionAnswering self.model = BertForQuestionAnswering.from_pretrained(model_name) def forward(self, input_ids, attention_mask): outputs = self.model(input_ids=input_ids, attention_mask=attention_mask) return outputs.start_logits, outputs.end_logits def predict(self, question, context): from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') inputs = tokenizer( question, context, padding=True, truncation=True, max_length=512, return_tensors='pt' ) with torch.no_grad(): start_logits, end_logits = self.forward(inputs['input_ids'], inputs['attention_mask']) start_idx = torch.argmax(start_logits) end_idx = torch.argmax(end_logits) tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) answer = tokenizer.convert_tokens_to_string(tokens[start_idx:end_idx+1]) return answer class QAWithRetrieval: def __init__(self, documents): self.retriever = DenseRetrieval() self.extractor = ExtractiveQA() self.documents = documents def answer(self, question): candidates = self.retriever.retrieve(question, self.documents, top_k=3) for doc, _ in candidates: answer = self.extractor.predict(question, doc) if answer.strip(): return answer return "No answer found"

2.3 生成式问答

class GenerativeQA(nn.Module): def __init__(self, model_name='t5-base'): super().__init__() from transformers import T5ForConditionalGeneration, T5Tokenizer self.model = T5ForConditionalGeneration.from_pretrained(model_name) self.tokenizer = T5Tokenizer.from_pretrained(model_name) def generate(self, question, context=None): if context: input_text = f"question: {question} context: {context}" else: input_text = f"question: {question}" inputs = self.tokenizer( input_text, padding=True, truncation=True, max_length=512, return_tensors='pt' ) with torch.no_grad(): outputs = self.model.generate( **inputs, max_length=100, num_beams=5, early_stopping=True ) answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True) return answer class OpenDomainQA: def __init__(self, retriever, generator): self.retriever = retriever self.generator = generator def answer(self, question, documents): candidates = self.retriever.retrieve(question, documents, top_k=3) context = "\n".join([doc for doc, _ in candidates]) return self.generator.generate(question, context)

3. 性能对比

3.1 问答系统类型对比

类型准确率灵活性训练数据推理速度
检索式很快
抽取式
生成式很高

3.2 不同 QA 数据集表现

数据集抽取式生成式检索+生成
SQuAD v192%88%90%
SQuAD v283%79%81%
HotpotQA75%72%78%

3.3 模型大小影响

模型参数F1推理时间(ms)
BERT-base110M89%50
BERT-large340M93%150
T5-base220M87%100
T5-large770M91%300

4. 最佳实践

4.1 问答系统选择

def select_qa_system(task_type, data_size): if task_type == 'retrieval': return RetrievalQA([]) elif task_type == 'extractive': return ExtractiveQA() elif task_type == 'generative': return GenerativeQA() else: return QAWithRetrieval([]) class QASystemFactory: @staticmethod def create(config): if config['type'] == 'retrieval': return RetrievalQA(config['documents']) elif config['type'] == 'extractive': return ExtractiveQA(config['model_name']) elif config['type'] == 'generative': return GenerativeQA(config['model_name']) elif config['type'] == 'hybrid': return OpenDomainQA( DenseRetrieval(config['retriever_model']), GenerativeQA(config['generator_model']) )

4.2 QA 系统训练流程

class QATrainer: def __init__(self, model, optimizer, scheduler, loss_fn): self.model = model self.optimizer = optimizer self.scheduler = scheduler self.loss_fn = loss_fn def train_step(self, batch): self.optimizer.zero_grad() input_ids = batch['input_ids'] attention_mask = batch['attention_mask'] start_positions = batch['start_positions'] end_positions = batch['end_positions'] start_logits, end_logits = self.model(input_ids, attention_mask) loss = (self.loss_fn(start_logits, start_positions) + self.loss_fn(end_logits, end_positions)) / 2 loss.backward() self.optimizer.step() self.scheduler.step() return loss.item() def evaluate(self, dataloader): self.model.eval() total_f1 = 0 with torch.no_grad(): for batch in dataloader: input_ids = batch['input_ids'] attention_mask = batch['attention_mask'] start_positions = batch['start_positions'] end_positions = batch['end_positions'] start_logits, end_logits = self.model(input_ids, attention_mask) start_pred = torch.argmax(start_logits, dim=1) end_pred = torch.argmax(end_logits, dim=1) for i in range(len(start_pred)): tp = ((start_pred[i] >= start_positions[i]) & (end_pred[i] <= end_positions[i])).sum().item() fp = ((start_pred[i] < start_positions[i]) | (end_pred[i] > end_positions[i])).sum().item() fn = ((start_pred[i] > start_positions[i]) | (end_pred[i] < end_positions[i])).sum().item() precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 total_f1 += f1 return total_f1 / len(dataloader)

5. 总结

问答系统是 NLP 的重要应用:

  1. 检索式:简单快速,适合小规模知识库
  2. 抽取式:准确,适合有上下文的场景
  3. 生成式:灵活,可生成自然语言答案
  4. 混合式:结合检索和生成,效果最佳

对比数据如下:

  • 生成式在开放域问答中表现更好
  • 抽取式在限定域问答中更准确
  • 推荐使用混合架构
  • 预训练模型是 QA 系统的基础
http://www.jsqmd.com/news/799691/

相关文章:

  • 3PEAK思瑞浦 TPA2772-SO1R SOP8 运算放大器
  • 蒙特卡洛估计与控制变量技术在量子误差消除中的应用
  • 免费试用 | 从宁德时代到宝利根,这款HMI组态软件为什么让工程师越用越顺手?
  • iOS激活锁终极绕过:Applera1n完整使用指南与安全解锁方案
  • 终极指南:3步掌握B站字幕提取与转换的核心技巧
  • VS Code图表神器:零配置用代码画UML、流程图与架构图
  • 全球200mm晶圆产能扩张21%:成熟制程的供应链博弈与未来趋势
  • BearBlog CLI:用Python命令行工具高效管理你的极简博客
  • 工业物联网无线传感器网络技术解析与应用
  • ARM A64指令集:条件分支与位操作深度解析
  • Eclipse的Post-build魔法:除了生成HEX,你的编译后步骤还能这样玩
  • 3PEAK思瑞浦 TPA2774-SO2R SOP14 运算放大器
  • Tiny AI Client:零依赖、轻量化的AI API调用库设计与实战
  • FreeRTOS中断里用xEventGroupSetBitsFromISR,这5个细节没处理好容易跑飞
  • MySQL八股之数据库索引优化:7个关键注意事项
  • 避坑指南:用Systemback给Ubuntu 18.04做系统备份,为什么物理机还原会失败?
  • RealSense D435深度图像有黑洞?别急着返修,试试这个动态校准工具(Target vs Targetless模式详解)
  • Cursor AI编程助手定制化规则:用MDC文件提升代码生成质量与一致性
  • USB 2.0合规性测试全解析:从原理到实践
  • 别再画PPT了!用Mermaid语法在Markdown里画UML图,效率翻倍(附VSCode插件推荐)
  • Google 发布 Fitbit Air 无屏手环,AI 助力无屏手环品类“起死回生”
  • 告别手动下载:用Python脚本自动化抓取HITRAN光谱数据库(附完整代码)
  • 从M1到DESFire:ISO14443协议卡家族的技术演进与安全实践
  • 5分钟掌握暗黑破坏神2存档编辑器:网页版d2s-editor完全指南
  • 数据库和数据仓库的区别
  • 从巴克码到m序列:二相编码脉冲压缩的工程实现与性能权衡
  • AI编程工程化实践:promptsLibrary配置库实现TDD与多代理协作
  • 基于Claude的代码工作流引擎:从AI对话到工程化自动编程
  • 2026最权威的降重复率网站推荐榜单
  • 5G手机省电的秘密:BWP动态带宽切换实战解析(附核心参数配置避坑指南)