当前位置：首页 > news >正文

【RAG】【vector_stores001】阿里云OpenSearch向量存储完整案例

news 2026/6/24 21:01:46

本案例演示如何使用 LlamaIndex 与阿里云 OpenSearch 向量搜索版集成，实现向量存储和检索功能，用于构建基于文档的问答系统。

1. 案例目标

本案例的主要目标是：

设置阿里云 OpenSearch 向量存储：配置 LlamaIndex 以使用阿里云 OpenSearch 作为向量数据库。
文档索引与存储：将文档内容加载并存储到阿里云 OpenSearch 向量数据库中。
查询与检索：基于存储的文档内容，实现自然语言查询并获取相关答案。
元数据过滤：演示如何使用元数据过滤来精确控制搜索结果。
连接现有存储：展示如何连接到已存在的向量存储并创建索引。

2. 技术栈与核心依赖

LlamaIndex：用于构建基于 LLM 的应用程序的框架
llama-index-vector-stores-alibabacloud-opensearch：LlamaIndex 的阿里云 OpenSearch 向量存储插件
OpenAI API：用于生成文本嵌入向量
阿里云 OpenSearch 向量搜索版：提供高性能向量搜索服务

核心依赖安装

%pip install llama-index-vector-stores-alibabacloud-opensearch %pip install llama-index

3. 环境配置

在使用本案例前，需要完成以下环境配置：

阿里云 OpenSearch 实例：在阿里云上创建并配置 OpenSearch 向量搜索版实例
OpenAI API Key：获取有效的 OpenAI API 密钥用于生成嵌入向量
配置信息：准备 OpenSearch 实例的 endpoint、instance_id、用户名和密码

4. 案例实现

4.1 导入必要的库和配置日志

import logging import sys logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

4.2 配置 OpenAI API

import openai OPENAI_API_KEY = getpass.getpass("OpenAI API Key:") openai.api_key = OPENAI_API_KEY

4.3 准备示例数据

# 创建数据目录并下载示例文档 !mkdir -p 'data/paul_graham/' !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' # 加载文档 from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("./data/paul_graham").load_data() print(f"Total documents: {len(documents)}")

4.4 设置阿里云 OpenSearch 向量存储

# 解决异步IO问题 import nest_asyncio nest_asyncio.apply() # 导入必要的库 from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.alibabacloud_opensearch import ( AlibabaCloudOpenSearchStore, AlibabaCloudOpenSearchConfig, ) # 配置 OpenSearch 连接参数 config = AlibabaCloudOpenSearchConfig( endpoint="*****", # 替换为您的 OpenSearch 端点 instance_id="*****", # 替换为您的实例ID username="your_username", # 替换为您的用户名 password="your_password", # 替换为您的密码 table_name="llama", # 替换为您的表名 ) # 创建向量存储和索引 vector_store = AlibabaCloudOpenSearchStore(config) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

4.5 执行查询

# 创建查询引擎并执行查询 query_engine = index.as_query_engine() response = query_engine.query("What did the author do growing up?") # 显示查询结果 from IPython.display import Markdown, display display(Markdown(f"{response}"))

4.6 连接到现有存储

# 连接到已存在的向量存储 from llama_index.core import VectorStoreIndex from llama_index.vector_stores.alibabacloud_opensearch import ( AlibabaCloudOpenSearchStore, AlibabaCloudOpenSearchConfig, ) config = AlibabaCloudOpenSearchConfig( endpoint="***", instance_id="***", username="your_username", password="your_password", table_name="llama", ) vector_store = AlibabaCloudOpenSearchStore(config) # 从现有存储的向量创建索引 index = VectorStoreIndex.from_vector_store(vector_store) query_engine = index.as_query_engine() response = query_engine.query( "What did the author study prior to working on AI?" ) display(Markdown(f"{response}"))

4.7 元数据过滤

# 创建带元数据的文档 def my_file_metadata(file_name: str): """根据输入文件名关联不同的元数据""" if "essay" in file_name: source_type = "essay" elif "dinosaur" in file_name: source_type = "dinos" else: source_type = "other" return {"source_type": source_type} # 加载文档并构建索引 md_documents = SimpleDirectoryReader( "../data/paul_graham", file_metadata=my_file_metadata ).load_data() md_index = VectorStoreIndex.from_documents( md_documents, storage_context=md_storage_context )

4.8 使用元数据过滤查询

# 添加过滤器到查询引擎 from llama_index.core.vector_stores import MetadataFilter, MetadataFilters md_query_engine = md_index.as_query_engine( filters=MetadataFilters( filters=[MetadataFilter(key="source_type", value="essay")] ) ) md_response = md_query_engine.query( "How long it took the author to write his thesis?" ) display(Markdown(f"{md_response}"))

5. 案例效果

通过本案例，您可以实现：

将文档内容向量化并存储到阿里云 OpenSearch 向量数据库中
使用自然语言查询文档内容，如 "What did the author do growing up?"
获取基于文档内容的准确回答
使用元数据过滤精确控制搜索结果
连接到已存在的向量存储并创建索引

6. 案例实现思路

本案例的核心实现思路是：

数据准备：下载示例文档作为数据源
向量存储配置：使用AlibabaCloudOpenSearchStore类配置阿里云 OpenSearch 作为向量数据库
文档处理：使用SimpleDirectoryReader加载文档
索引创建：使用VectorStoreIndex将文档内容向量化并存储
查询执行：通过query_engine执行自然语言查询
元数据管理：为文档添加元数据并实现基于元数据的过滤查询
持久化连接：演示如何连接到已存在的向量存储

7. 扩展建议

批量文档处理：扩展系统以支持批量处理多个文档
高级元数据过滤：实现更复杂的元数据过滤条件，如范围查询、多条件组合等
混合搜索：结合向量搜索和传统文本搜索提高检索精度
性能优化：针对大规模文档集优化索引和查询性能
实时更新：实现文档内容的实时更新和增量索引
多语言支持：扩展系统以支持多语言文档处理和查询
自定义嵌入模型：集成其他嵌入模型替代 OpenAI，如本地模型或其他云服务

8. 总结

本案例展示了如何使用 LlamaIndex 与阿里云 OpenSearch 向量搜索版集成，构建一个基于向量存储的文档问答系统。通过将文档内容向量化并存储在阿里云 OpenSearch 中，我们可以实现高效的语义搜索和准确的问答功能。阿里云 OpenSearch 向量搜索版提供了高性能、高可用的向量搜索服务，特别适合需要处理大量文档并提供智能问答的企业应用场景。

查看全文

http://www.jsqmd.com/news/593457/