RAG系统优化实战
RAG系统优化实战:检索增强生成的工程实践
从基础到高级,深入解析RAG系统的优化策略与踩坑经验
前言
RAG(Retrieval-Augmented Generation)是目前最实用的LLM应用方案之一。简单说就是:先检索相关文档,再让LLM基于检索结果生成回答。
但很多开发者在实际落地时发现:检索不准、生成跑偏、延迟太高。本文基于我优化多个RAG项目的实战经验,分享从向量检索到生成质量的全链路优化方案。
你将学到:
- RAG系统的核心瓶颈分析
- 文档切分策略对比
- 向量检索优化技巧
- 检索结果重排序
- Prompt工程优化
- 评估指标与监控
一、RAG基础架构回顾
1.1 标准RAG流程
fromlangchain_openaiimportOpenAIEmbeddings,ChatOpenAIfromlangchain_community.vectorstoresimportFAISSfromlangchain.text_splitterimportRecursiveCharacterTextSplitterfromlangchain_community.document_loadersimportPyPDFLoader# 1. 加载文档loader=PyPDFLoader("knowledge.pdf")documents=loader.load()# 2. 切分文档splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)chunks=splitter.split_documents(documents)# 3. 向量化存储embeddings=OpenAIEmbeddings()vectorstore=FAISS.from_documents(chunks,embeddings)# 4. 检索 + 生成retriever=vectorstore.as_retriever(search_kwargs={"k":4})llm=ChatOpenAI(model="gpt-4")# 组合RAG链fromlangchain.chainsimportRetrievalQA qa_chain=RetrievalQA.from_chain_type(llm=llm,retriever=retriever,chain_type="stuff")result=qa_chain.invoke({"query":"什么是RAG?"})1.2 常见问题
| 问题 | 表现 | 原因 |
|---|---|---|
| 检索不准 | 返回不相关的内容 | 文档切分不合理、embedding模型选错 |
| 信息丢失 | 关键信息被切断 | chunk_size太小、没有overlap |
| 生成幻觉 | LLM编造不存在的内容 | 检索结果质量差、prompt没约束 |
| 延迟太高 | 响应超过5秒 | 向量库太大、没有缓存 |
二、文档切分优化
2.1 切分策略对比
fromlangchain.text_splitterimport(RecursiveCharacterTextSplitter,CharacterTextSplitter,TokenTextSplitter,MarkdownHeaderTextSplitter,PythonCodeTextSplitter)# 策略1:递归字符切分(通用推荐)recursive_splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50,separators=["\n\n","\n","。","!","?","."," ",""])# 策略2:按标题切分(结构化文档)headers_to_split=[("#","h1"),("##","h2"),("###","h3"),]md_splitter=MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)# 策略3:代码专用切分code_splitter=PythonCodeTextSplitter(chunk_size=1000,chunk_overlap=100)2.2 Chunk Size选择
# 实验:不同chunk_size对检索质量的影响importnumpyasnpfromsklearn.metrics.pairwiseimportcosine_similaritydefevaluate_chunk_sizes(docs,query,sizes=[200,500,1000,2000]):results={}forsizeinsizes:splitter=RecursiveCharacterTextSplitter(chunk_size=size,chunk_overlap=size//10)chunks=splitter.split_documents(docs)# 计算平均相似度embeddings=OpenAIEmbeddings()query_vec=embeddings.embed_query(query)chunk_vecs=embeddings.embed_documents([c.page_contentforcinchunks])similarities=cosine_similarity([query_vec],chunk_vecs)[0]results[size]={"avg_similarity":np.mean(similarities),"max_similarity":np.max(similarities),"chunk_count":len(chunks)}returnresults# 实测结果(仅供参考):# chunk_size=200: avg_sim=0.78, 但信息碎片化严重# chunk_size=500: avg_sim=0.82, 平衡点# chunk_size=1000: avg_sim=0.80, 上下文更完整# chunk_size=2000: avg_sim=0.75, 噪声增加我的经验:
- 通用文档:500-800 tokens
- 技术文档/代码:800-1200 tokens
- 对话记录:300-500 tokens
- 法律合同:按条款切分,不固定大小
2.3 Overlap设置
# ❌ 错误:没有overlapsplitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=0# 信息会在边界断裂)# ❌ 错误:overlap太大splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=400# 80%重叠,浪费存储和token)# ✅ 正确:10-20%的overlapsplitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=75# 15%重叠)2.4 语义切分
# 基于语义相似度的智能切分fromlangchain_experimental.text_splitterimportSemanticChunker semantic_splitter=SemanticChunker(OpenAIEmbeddings(),breakpoint_threshold_type="percentile",breakpoint_threshold_amount=95)# 好处:在语义边界切分,保持上下文连贯# 缺点:需要调用embedding API,成本增加三、Embedding模型选择
3.1 主流模型对比
# 模型性能对比(MTEB基准)models={"text-embedding-3-small":{"dim":1536,"price":"$0.02/1M tokens"},"text-embedding-3-large":{"dim":3072,"price":"$0.13/1M tokens"},"bge-large-zh-v1.5":{"dim":1024,"price":"免费开源"},"m3e-base":{"dim":768