当前位置: 首页 > news >正文

Chroma向量数据库的安装与简单使用

目录

目标

Chroma的版本

官网

安装Chroma

实战

最简实现

新增数据

新增元数据(metadatas)

新增外部向量

删除数据

根据ID删除数据

根据where删除数据

修改数据

upsert方法

update方法

自定义嵌入模型


目标

初步掌握Chroma向量数据库的使用方法,包括增删改查及自定义嵌入模型。Chroma向量数据库有Client-Server Mode和Chroma Clients两种使用模式,这里以Chroma Clients模式作为我们的入门演示。


Chroma的版本

1.5.5


官网

https://docs.trychroma.com/docs/overview/getting-startedhttps://docs.trychroma.com/docs/overview/getting-started


安装Chroma

第一步:安装Chroma向量数据库。不能科学上网的同学可以使用国内镜像去安装。

pip install chromadb -i http://mirrors.aliyun.com/pypi/simple/

实战

最简实现

第一步:创建集合。

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection")

第二步:往集合中添加数据。

# Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"], documents=[ "我喜欢看金庸的武侠小说。", "今天的工作任务很多。", "人工智能非常难学。", "凡人修仙传动画片很好看。", "今天的股票大涨。", "国际油价持续上涨。", "金价持续上涨。", "乔丹是NBA当之无愧的第一人。" ] )

第三步:查询集合中的相关数据。这里注意:第一次使用Chroma时,程序会下载并安装all-MiniLM-L6-v2的嵌入模型。

#使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。 results = collection.query( query_texts=["经济基础决定上层建筑。"], n_results=3 ) print(results)


新增数据

新增元数据(metadatas

向量数据库并非只存向量,元数据也很重要。因为向量只解决相似度检索,元数据提升业务可控性。元数据的作用具体体现在这些场景:文档来源、业务分类、时间信息、权限与安全控制、约束检索范围(不同用户或公司输入相同的问题得到不同的结果)。如下面的代码中,元数据明确规定了不同角色之间的数据权限。

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=[ "123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463" ], documents=[ "退款规则:电商订单在7天内可申请无理由退款。", "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。", "员工请假制度:年假需提前3天申请,病假需提供医疗证明。", "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。", "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。", "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。", "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。", "物流配送规则:标准配送3-5天,加急配送24小时内送达。" ], metadatas=[ { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "shop_A", "business_unit": "fintech", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "company_hr", "business_unit": "hr", "region": "global", "language": "zh", "doc_type": "handbook", "permission": "employee" }, { "tenant_id": "company_ops", "business_unit": "devops", "region": "global", "language": "zh", "doc_type": "runbook", "permission": "engineer" }, { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "vip" }, { "tenant_id": "gov_sg", "business_unit": "tax", "region": "SG", "language": "zh", "doc_type": "regulation", "permission": "public" }, { "tenant_id": "company_it", "business_unit": "security", "region": "global", "language": "zh", "doc_type": "policy", "permission": "admin" }, { "tenant_id": "shop_A", "business_unit": "logistics", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" } ] )

新增外部向量

新增操作必须提供文档、向量或两者都提供(如果文档存储在其他地方或者文档内容很大,推荐只添加嵌入和元数据,这里就不做演示了。)。元数据是可选的。当只提供文档时,Chroma将使用集合的嵌入功能为生成向量。这一点在之前的最简配置已经得到了证实。如果我们使用其他方法得到了文档对应的向量,我们可以直接将这些向量保存进Chroma中,此时Chroma中的嵌入模型不会重新生成向量将它们覆盖。

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"], documents=[ "我喜欢看金庸的武侠小说。", "今天的工作任务很多。", "人工智能非常难学。", "凡人修仙传动画片很好看。", "今天的股票大涨。", "国际油价持续上涨。", "金价持续上涨。", "乔丹是NBA当之无愧的第一人。" ], embeddings=[ [1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], [4.6, 6.3, 4.4], [2.0, 3.1, 5.6], [7.2, 1.4, 0.9], [3.3, 8.8, 2.1], [5.5, 2.2, 9.7] ] ) # 使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。 results = collection.query( # 因为collection中已手动写入3维embedding, # 若使用query_texts,Chroma会通过默认embedding_function生成384维向量, # 会导致维度不匹配错误。 # 因此这里使用query_embeddings,直接提供同维度(3维)向量进行检索。 query_embeddings=[[0.1, 0.2, 0.3]], n_results=3, include=["embeddings", "documents", "distances"] ) print(results) #向量维度 print(len(results["embeddings"][0][0]))

删除数据

根据ID删除数据

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=["123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463"], documents=[ "我喜欢看金庸的武侠小说。", "今天的工作任务很多。", "人工智能非常难学。", "凡人修仙传动画片很好看。", "今天的股票大涨。", "国际油价持续上涨。", "湖人总冠军。", "乔丹是NBA当之无愧的第一人。" ] ) #使用一组查询文本对集合进行查询,Chroma将返回最相似的n个结果。 results = collection.query( query_texts=["谁是NBA第一人。"], n_results=3 ) print(results) collection.delete( ids=["123463"], ) results = collection.query( query_texts=["谁是NBA第一人。"], n_results=3 ) print(results)

根据where删除数据

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=[ "123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463" ], documents=[ "退款规则:电商订单在7天内可申请无理由退款。", "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。", "员工请假制度:年假需提前3天申请,病假需提供医疗证明。", "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。", "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。", "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。", "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。", "物流配送规则:标准配送3-5天,加急配送24小时内送达。" ], metadatas=[ { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "shop_A", "business_unit": "fintech", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "company_hr", "business_unit": "hr", "region": "global", "language": "zh", "doc_type": "handbook", "permission": "employee" }, { "tenant_id": "company_ops", "business_unit": "devops", "region": "global", "language": "zh", "doc_type": "runbook", "permission": "engineer" }, { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "vip" }, { "tenant_id": "gov_sg", "business_unit": "tax", "region": "SG", "language": "zh", "doc_type": "regulation", "permission": "public" }, { "tenant_id": "company_it", "business_unit": "security", "region": "global", "language": "zh", "doc_type": "policy", "permission": "admin" }, { "tenant_id": "shop_A", "business_unit": "logistics", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" } ] ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1 ) print(results) collection.delete( where={ "tenant_id": "shop_A", } ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1 ) print(results)

修改数据

upsert表示有则覆盖无则插入;update则只做修改

upsert方法

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=[ "123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463" ], documents=[ "退款规则:电商订单在7天内可申请无理由退款。", "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。", "员工请假制度:年假需提前3天申请,病假需提供医疗证明。", "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。", "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。", "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。", "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。", "物流配送规则:标准配送3-5天,加急配送24小时内送达。" ], metadatas=[ { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "shop_A", "business_unit": "fintech", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "company_hr", "business_unit": "hr", "region": "global", "language": "zh", "doc_type": "handbook", "permission": "employee" }, { "tenant_id": "company_ops", "business_unit": "devops", "region": "global", "language": "zh", "doc_type": "runbook", "permission": "engineer" }, { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "vip" }, { "tenant_id": "gov_sg", "business_unit": "tax", "region": "SG", "language": "zh", "doc_type": "regulation", "permission": "public" }, { "tenant_id": "company_it", "business_unit": "security", "region": "global", "language": "zh", "doc_type": "policy", "permission": "admin" }, { "tenant_id": "shop_A", "business_unit": "logistics", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" } ] ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results) collection.upsert( ids=["123456", ], documents=["退款规则:电商订单在30天内可申请无理由退款。", ], ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results) collection.upsert( ids=["888888", ], documents=["游戏很好玩。", ], ) results = collection.query( query_texts=["游戏好不好玩?"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results)

update方法

import chromadb chroma_client = chromadb.Client() # 创建集合 collection = chroma_client.create_collection(name="first_collection") # Chroma将自动存储文本并处理嵌入和索引 # 下面三条数据对应三个ID collection.add( ids=[ "123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463" ], documents=[ "退款规则:电商订单在7天内可申请无理由退款。", "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。", "员工请假制度:年假需提前3天申请,病假需提供医疗证明。", "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。", "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。", "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。", "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。", "物流配送规则:标准配送3-5天,加急配送24小时内送达。" ], metadatas=[ { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "shop_A", "business_unit": "fintech", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "company_hr", "business_unit": "hr", "region": "global", "language": "zh", "doc_type": "handbook", "permission": "employee" }, { "tenant_id": "company_ops", "business_unit": "devops", "region": "global", "language": "zh", "doc_type": "runbook", "permission": "engineer" }, { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "vip" }, { "tenant_id": "gov_sg", "business_unit": "tax", "region": "SG", "language": "zh", "doc_type": "regulation", "permission": "public" }, { "tenant_id": "company_it", "business_unit": "security", "region": "global", "language": "zh", "doc_type": "policy", "permission": "admin" }, { "tenant_id": "shop_A", "business_unit": "logistics", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" } ] ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results) collection.update( ids=["123456", ], documents=["退款规则:电商订单在30天内可申请无理由退款。", ], ) results = collection.query( query_texts=["电商订单在多少天内可申请无理由退款。"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results) collection.update( ids=["888888", ], documents=["张三的爱好是打篮球。", ], ) results = collection.query( query_texts=["张三的爱好是什么?"], n_results=1, include=["metadatas", "documents", "distances"] ) print(results)

自定义嵌入模型

适配器类

import requests class MyOllamaEmbeddingFunction: def __init__(self, model="qwen3-embedding:8b"): self.model = model self.url = "http://localhost:11434/api/embeddings" self.session = requests.Session() def _embed(self, texts): embeddings = [] for text in texts: res = self.session.post(self.url, json={ "model": self.model, "prompt": text }) embeddings.append(res.json()["embedding"]) return embeddings #给 add 用 def embed_documents(self, input): return self._embed(input) #给 query 用 def embed_query(self, input): return self._embed(input) #为兼容旧接口(可选) def __call__(self, input): return self._embed(input)

测试

import chromadb from chroma_test.MyOllamaEmbeddingFunction import MyOllamaEmbeddingFunction client = chromadb.Client() embedding_fn = MyOllamaEmbeddingFunction() collection = client.create_collection( name="rag_collection", embedding_function=embedding_fn ) collection.add( ids=[ "123456", "123457", "123458", "123459", "123460", "123461", "123462", "123463" ], documents=[ "退款规则:电商订单在7天内可申请无理由退款。", "金融产品赎回规则:理财产品需T+2日到账,提前赎回可能产生手续费。", "员工请假制度:年假需提前3天申请,病假需提供医疗证明。", "服务器故障处理流程:P1级故障需10分钟内响应并启动应急预案。", "VIP客户退款政策:VIP用户可享受优先退款审核通道,24小时内处理。", "新加坡地区税务说明:GST税率为9%,适用于所有消费类交易。", "数据访问权限说明:敏感数据仅限admin角色访问,普通用户无权限。", "物流配送规则:标准配送3-5天,加急配送24小时内送达。" ], metadatas=[ { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "shop_A", "business_unit": "fintech", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" }, { "tenant_id": "company_hr", "business_unit": "hr", "region": "global", "language": "zh", "doc_type": "handbook", "permission": "employee" }, { "tenant_id": "company_ops", "business_unit": "devops", "region": "global", "language": "zh", "doc_type": "runbook", "permission": "engineer" }, { "tenant_id": "shop_A", "business_unit": "ecommerce", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "vip" }, { "tenant_id": "gov_sg", "business_unit": "tax", "region": "SG", "language": "zh", "doc_type": "regulation", "permission": "public" }, { "tenant_id": "company_it", "business_unit": "security", "region": "global", "language": "zh", "doc_type": "policy", "permission": "admin" }, { "tenant_id": "shop_A", "business_unit": "logistics", "region": "SG", "language": "zh", "doc_type": "policy", "permission": "user" } ] ) results = collection.query( query_texts=["请说一下退款规则"], n_results=1, include=["embeddings", "documents", "distances"] ) print(results) #向量维度 print(len(results["embeddings"][0][0])) #验证修改操作是使用的Chroma默认的嵌入模型还是我们自定义的模型 collection.upsert( ids=["888888", ], documents=["游戏很好玩。", ], ) results = collection.query( query_texts=["游戏好不好玩?"], n_results=1, include=["embeddings", "documents", "distances"] ) print(results) print(len(results["embeddings"][0][0])) collection.update( ids=["888888", ], documents=["游戏不好玩。", ], ) results = collection.query( query_texts=["游戏好不好玩?"], n_results=1, include=["embeddings", "documents", "distances"] ) print(results) print(len(results["embeddings"][0][0]))

验证结果

http://www.jsqmd.com/news/579926/

相关文章:

  • 突破多模态开发进阶三大瓶颈
  • 网站纠错页面对 SEO 有什么作用_网站图片和视频优化对 SEO 有什么技巧
  • 2026年比较好的古方泡浴/纯阳水泡浴/儿童泡浴/草本泡浴制造厂家哪家靠谱 - 行业平台推荐
  • Cogito-V1-Preview-Llama-3B部署实操:Win11系统优化与GPU环境配置
  • Phi-3-Mini-128K与MATLAB联动:科学计算与AI建模的融合实践
  • 2026年评价高的化妆台智能五金/餐桌智能五金/洗漱智能五金/茶台智能五金专业制造厂家推荐 - 行业平台推荐
  • MogFace模型Docker容器化部署:基于GitHub Actions的CI/CD实践
  • AcousticSense AI生产部署:Prometheus+Grafana监控ViT推理延迟与错误率
  • 企业中Agent Skill是如何使用的,Skill到底是啥,从概念到落地详解
  • 2026年靠谱的庭院智能灯光设计/酒店智能灯光设计/无主灯智能灯光设计/会所智能灯光设计厂家精选 - 行业平台推荐
  • 工业C++功能安全开发落地难?(20年FAE亲授:西门子PLC边缘控制器项目中的MISRA-C+++AUTOSAR OS集成全复盘)
  • STEP3-VL-10B开源大模型:支持ONNX导出+边缘设备轻量化部署
  • 从USGS官网到Python代码:自动化获取Landsat各型号增益偏置值的完整流程
  • 2026年热门的净化板/净化操作台/净化厂房/净化设备实力品牌厂家推荐 - 行业平台推荐
  • 2026年知名的气撑家具功能五金/滑轨家具功能五金实力品牌厂家推荐 - 行业平台推荐
  • 不止于安装:用Pangolin在Ubuntu20.04上快速可视化你的第一个SLAM点云
  • 2026年热门的公路防护石笼网/景观装饰石笼网/水利工程石笼网/石笼网生产厂家推荐几家 - 行业平台推荐
  • 2026年比较好的洁净厂房/洁净设备/洁净板制造厂家推荐 - 行业平台推荐
  • NVIDIA Nemotron OCR v2:多语言文本识别新标杆
  • Hunyuan-MT-7B开源镜像:像素语言传送门v1.2-Legendary版Docker镜像拉取与验证教程
  • 使用Typora与Phi-3-mini-4k-instruct-gguf打造智能Markdown写作工作流
  • 2026年比较好的玻璃钢锚杆拉力计/陕西玻璃钢锚杆制造厂家哪家靠谱 - 行业平台推荐
  • PDF-Extract-Kit-1.0效果展示:高精度表格识别与公式还原真实案例集
  • 我的项目复盘,以及踩过的雷点
  • 告别轮询!用STM32串口空闲中断+DMA接收不定长数据,CubeMX配置保姆级教程
  • 2026年评价高的荣成旧房改造装修/荣成民房装修本地公司推荐 - 行业平台推荐
  • 2026年热门的气动矿用锯/矿用锯/陕西气动圆盘切割矿用锯/切割矿用锯厂家选择指南 - 行业平台推荐
  • 2026年质量好的张拉机具/矿用气动锚索张拉机具高口碑品牌推荐 - 行业平台推荐
  • MQ-2传感器数据飘忽不定?可能是你的ADC采样没做好(附STM32与ESP32配置要点)
  • 2026年热门的油管内衬设备装管机/油管内衬设备封口机/油管内衬设备口碑好的厂家推荐 - 行业平台推荐