当前位置：首页 > news >正文

【RAG】【retrievers14】路由检索器

news 2026/7/17 22:03:58

案例目标

本案例演示了如何使用RouterRetriever构建一个智能路由系统，该系统能够根据查询内容动态选择最适合的检索器。

智能路由选择：根据查询类型和内容自动选择最合适的检索器
多检索器集成：集成多种不同类型的检索器，包括列表检索器、向量检索器和关键词检索器
单选与多选模式：支持单选器(PydanticSingleSelector)和多选器(PydanticMultiSelector)两种模式
基于LLM的决策：利用大语言模型的推理能力进行检索器选择

通过路由检索器，我们可以构建更加灵活和智能的检索系统，能够根据不同的查询需求动态选择最适合的检索策略，从而提高检索效率和准确性。

技术栈与核心依赖

核心库

llama-index-llms-openai
llama-index

路由与选择

RouterRetriever
PydanticSingleSelector
PydanticMultiSelector
LLMSingleSelector
LLMMultiSelector

检索器与索引

RetrieverTool
VectorStoreIndex
SummaryIndex
SimpleKeywordTableIndex
SentenceSplitter

环境配置

安装依赖

%pip install llama-index-llms-openai !pip install llama-index

环境设置

# NOTE: This is ONLY necessary in jupyter notebook. import nest_asyncio nest_asyncio.apply() import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().handlers = [] logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

注意：需要设置有效的OpenAI API密钥才能运行此示例。此外，在Jupyter notebook中运行需要使用nest_asyncio来处理嵌套事件循环。

案例实现

1. 数据准备与索引构建

下载Paul Graham的文章并创建三种不同类型的索引：

# 下载数据 !mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' # 加载文档 documents = SimpleDirectoryReader("./data/paul_graham/").load_data() # 初始化LLM和分割器 llm = OpenAI(model="gpt-4") splitter = SentenceSplitter(chunk_size=1024) nodes = splitter.get_nodes_from_documents(documents) # 初始化存储上下文 storage_context = StorageContext.from_defaults() storage_context.docstore.add_documents(nodes) # 创建三种不同类型的索引 summary_index = SummaryIndex(nodes, storage_context=storage_context) vector_index = VectorStoreIndex(nodes, storage_context=storage_context) keyword_index = SimpleKeywordTableIndex(nodes, storage_context=storage_context)

2. 创建检索器工具

为每种索引创建对应的检索器，并将其包装为RetrieverTool：

# 创建检索器 list_retriever = summary_index.as_retriever() vector_retriever = vector_index.as_retriever() keyword_retriever = keyword_index.as_retriever() # 创建检索器工具 from llama_index.core.tools import RetrieverTool list_tool = RetrieverTool.from_defaults( retriever=list_retriever, description=( "Will retrieve all context from Paul Graham's essay on What I Worked" " On. Don't use if the question only requires more specific context." ), ) vector_tool = RetrieverTool.from_defaults( retriever=vector_retriever, description=( "Useful for retrieving specific context from Paul Graham essay on What" " I Worked On." ), ) keyword_tool = RetrieverTool.from_defaults( retriever=keyword_retriever, description=( "Useful for retrieving specific context from Paul Graham essay on What" " I Worked On (using entities mentioned in query)" ), )

3. 单选路由检索器实现

使用PydanticSingleSelector创建单选路由检索器，每次只选择一个最佳检索器：

from llama_index.core.selectors import ( PydanticMultiSelector, PydanticSingleSelector, ) from llama_index.core.retrievers import RouterRetriever # 创建单选路由检索器 retriever = RouterRetriever( selector=PydanticSingleSelector.from_defaults(llm=llm), retriever_tools=[ list_tool, vector_tool, ], )

PydanticSingleSelector使用OpenAI的Function Call API来生成结构化的选择对象，而不是解析原始JSON。这种方式更加可靠，目前支持gpt-4-0613和gpt-3.5-turbo-0613模型。

4. 单选路由检索器查询示例

执行两种不同类型的查询，观察路由检索器的选择行为：

# 查询1：获取作者生活的所有上下文 nodes = retriever.retrieve( "Can you give me all the context regarding the author's life?" ) # 输出：Selecting retriever 0: This choice is most relevant as it mentions retrieving all context from the essay... # 查询2：获取特定细节 nodes = retriever.retrieve("What did Paul Graham do after RISD?") # 输出：Selecting retriever 1: The question asks for a specific detail from Paul Graham's essay...

观察结果：对于第一个查询，路由器选择了列表检索器(list_tool)，因为查询要求获取"所有上下文"；对于第二个查询，路由器选择了向量检索器(vector_tool)，因为查询要求获取"特定细节"。

5. 多选路由检索器实现

使用PydanticMultiSelector创建多选路由检索器，可以选择多个检索器：

# 创建多选路由检索器 retriever = RouterRetriever( selector=PydanticMultiSelector.from_defaults(llm=llm), retriever_tools=[list_tool, vector_tool, keyword_tool], )

PydanticMultiSelector可以选择多个检索器来处理查询，这对于需要从多个角度检索信息的复杂查询特别有用。

6. 多选路由检索器查询示例

执行包含多个实体的查询，观察多选路由检索器的选择行为：

# 查询：包含多个实体的复杂查询 nodes = retriever.retrieve( "What were noteable events from the authors time at Interleaf and YC?" ) # 输出： # Selecting retriever 1: This choice is relevant as it allows for retrieving specific context... # Selecting retriever 2: This choice is also relevant as it allows for retrieving specific context using entities... # query keywords: ['interleaf', 'events', 'noteable', 'yc'] # > Extracted keywords: ['interleaf', 'yc']

观察结果：对于包含多个实体(Interleaf和YC)的查询，多选路由器同时选择了向量检索器和关键词检索器，并提取了查询中的关键词进行进一步处理。

案例效果

我们比较了单选和多选路由检索器在不同类型查询下的表现：

单选路由检索器

查询1："Can you give me all the context regarding the author's life?"
选择：列表检索器(list_tool)
结果：返回文档中的所有节点，适合获取全面信息
查询2："What did Paul Graham do after RISD?"
选择：向量检索器(vector_tool)
结果：返回与查询最相关的节点，相似度0.80-0.79

多选路由检索器

查询："What were noteable events from the authors time at Interleaf and YC?"
选择：向量检索器 + 关键词检索器
关键词提取：['interleaf', 'yc']
结果：结合语义搜索和关键词匹配，返回更全面的相关节点

结论：路由检索器能够根据查询内容和类型智能选择最适合的检索器。单选模式适合简单查询，多选模式适合包含多个实体或需要多角度检索的复杂查询。通过LLM的推理能力，路由检索器能够理解查询意图并做出合理的选择决策。

路由决策示例

# 单选路由决策示例 Selecting retriever 0: This choice is most relevant as it mentions retrieving all context from the essay, which could include information about the author's life. Selecting retriever 1: The question asks for a specific detail from Paul Graham's essay on 'What I Worked On'. Therefore, the second choice, which is useful for retrieving specific context, is the most relevant. # 多选路由决策示例 Selecting retriever 1: This choice is relevant as it allows for retrieving specific context from the essay, which is needed to answer the question about notable events at Interleaf and YC. Selecting retriever 2: This choice is also relevant as it allows for retrieving specific context using entities mentioned in the query, which in this case are 'Interleaf' and 'YC'.