当前位置：首页 > news >正文

【RAG】【retrievers09】Pathway检索器：实时数据索引与检索

news 2026/7/22 3:21:52

案例目标

本案例展示如何使用Pathway框架构建实时数据索引与检索系统，实现动态数据源的持续监控和实时更新。Pathway是一个开源的数据处理框架，允许开发人员轻松构建处理实时数据源和变化数据的数据转换管道和机器学习应用程序。

通过PathwayRetriever，我们可以连接到实时更新的数据索引，获取最新的检索结果，而无需手动重新构建索引。这对于需要处理频繁变化数据的应用场景（如文档协作、实时数据流等）特别有价值。

技术栈与核心依赖

llama-index-retrievers-pathway
pathway
llama-index-embeddings-openai
llama-index-core
llama-index-llms-openai

环境配置

# 安装必要的依赖
pip install llama-index-retrievers-pathway pathway
pip install llama-index-embeddings-openai
# 设置API密钥
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

案例实现

1. 使用公共演示管道

步骤 1

连接到Pathway提供的公共演示管道：
from llama_index.retrievers.pathway import PathwayRetriever

# 连接到公共演示管道
retriever = PathwayRetriever(
url="https://demo-document-indexing.pathway.stream"
)

# 执行检索
results = retriever.retrieve("what is pathway")
for result in results:
print(f"Score: {result.score}, Text: {result.text[:100]}...")

2. 构建自定义数据处理管道

步骤 2

定义数据源：

import pathway as pw

# 定义数据源列表
data_sources = []

# 添加本地文件系统数据源
data_sources.append(
pw.io.fs.read(
"./data",
format="binary",
mode="streaming",
with_metadata=True,
)
)

# 可以添加更多数据源，如Google Drive、SharePoint等
# data_sources.append(
# pw.io.gdrive.read(
# object_id="your_folder_id",
# service_user_credentials_file="credentials.json",
# with_metadata=True
# )
# )

步骤 3

创建文档索引管道：

from pathway.xpacks.llm.vector_store import VectorStoreServer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter

# 初始化嵌入模型
embed_model = OpenAIEmbedding(embed_batch_size=10)

# 定义转换管道
transformations_example = [
TokenTextSplitter(
chunk_size=150,
chunk_overlap=10,
separator=" ",
),
embed_model,
]

# 创建向量存储服务器
processing_pipeline = VectorStoreServer.from_llamaindex_components(
*data_sources,
transformations=transformations_example,
)

# 定义服务器主机和端口
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

# 运行服务器
processing_pipeline.run_server(
host=PATHWAY_HOST,
port=PATHWAY_PORT,
with_cache=False,
threaded=True
)

步骤 4

连接到自定义管道：

# 连接到自定义管道
retriever = PathwayRetriever(host=PATHWAY_HOST, port=PATHWAY_PORT)

# 执行检索
results = retriever.retrieve("what is pathway")
for result in results:
print(f"Score: {result.score}, Text: {result.text[:100]}...")

3. 在查询引擎中使用

步骤 5

创建查询引擎：

from llama_index.core.query_engine import RetrieverQueryEngine

# 创建查询引擎
query_engine = RetrieverQueryEngine.from_args(retriever)

# 执行查询
response = query_engine.query("Tell me about Pathway")
print(str(response))