当前位置: 首页 > news >正文

llama-index 相似度会把文档的名称也计算吗

简要结论

默认情况下,文档名称(通常作为metadata中的file_name)会被注入到节点文本中,参与嵌入生成,从而影响相似度计算;你可以通过excluded_embed_metadata_keys将其排除,使相似度仅基于正文内容计算 1 2 。

详细说明

默认行为:元数据参与嵌入

  • Documentmetadata(包括file_name)会被节点继承,并在生成嵌入时默认注入到文本中,因此相似度计算会包含这些元数据信息 1 。
  • 注入格式由metadata_seperatormetadata_templatetext_template控制,默认为"{key}: {value}\n\n{content}"3 。

如何排除文档名称不影响相似度

  • 设置document.excluded_embed_metadata_keys = ["file_name"]可让嵌入模型忽略file_name,从而相似度仅基于正文与其他未被排除的元数据 2 。
  • 若仍希望 LLM 在生成时看到文件名,可单独设置excluded_llm_metadata_keys,不影响嵌入计算 4 。

示例

fromllama_index.coreimportDocument doc=Document(text="正文内容",metadata={"file_name":"report.pdf"},excluded_embed_metadata_keys=["file_name"],# 嵌入时忽略文件名excluded_llm_metadata_keys=[],# LLM 仍可见(可选))

检索时的元数据过滤(不影响相似度)

  • 你可以在检索时用filtersfile_name过滤候选节点,但这属于预筛选,不改变嵌入向量与相似度计算逻辑 5 。

Notes

  • 若使用SimpleDirectoryReader(file_metadata=filename_fn),文件名会自动写入metadata,默认参与嵌入;如需排除,请在构建Document后统一设置excluded_embed_metadata_keys6 。
  • 节点在索引时已按上述策略生成嵌入,检索时retrieve方法直接使用这些嵌入向量进行相似度比较,不会临时重新判断元数据是否参与 7 。

Citations

File:docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L36-41)

### Metadata Documents also offer the chance to include useful metadata. Using the `metadata` dictionary on each document, additional information can be included to help inform responses and track down sources for query responses. This information can be anything, such as filenames or categories. If you are integrating with a vector database, keep in mind that some vector databases require that the keys must be strings, and the values must be flat (either `str`, `float`, or `int`). Any information set in the `metadata` dictionary of each document will show up in the `metadata` of each source node created from the document. Additionally, this information is included in the nodes, enabling the index to utilize it on queries and responses. By default, the metadata is injected into the text for both embedding and LLM model calls.

File:docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L59-70)

3. Set the filename automatically using the `SimpleDirectoryReader` and `file_metadata` hook. This will automatically run the hook on each document to set the `metadata` field: ```python from llama_index.core import SimpleDirectoryReader filename_fn = lambda filename: {"file_name": filename} # automatically sets the metadata of each document according to filename_fn documents = SimpleDirectoryReader( "./data", file_metadata=filename_fn ).load_data()
**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L95-103) ```markdown #### Customizing LLM Metadata Text Typically, a document might have many metadata keys, but you might not want all of them visible to the LLM during response synthesis. In the above examples, we may not want the LLM to read the `file_name` of our document. However, the `file_name` might include information that will help generate better embeddings. A key advantage of doing this is to bias the embeddings for retrieval without changing what the LLM ends up reading. We can exclude it like so: ```python document.excluded_llm_metadata_keys = ["file_name"]
**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L113-119) ```markdown #### Customizing Embedding Metadata Text Similar to customizing the metadata visible to the LLM, we can also customize the metadata visible to embeddings. In this case, you can specifically exclude metadata visible to the embedding model, in case you DON'T want particular text to bias the embeddings. ```python document.excluded_embed_metadata_keys = ["file_name"]
**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L131-143) ```markdown As you know by now, metadata is injected into the actual text of each document/node when sent to the LLM or embedding model. By default, the format of this metadata is controlled by three attributes: 1. `Document.metadata_seperator` -> default = `"\n"` When concatenating all key/value fields of your metadata, this field controls the separator between each key/value pair. 2. `Document.metadata_template` -> default = `"{key}: {value}"` This attribute controls how each key/value pair in your metadata is formatted. The two variables `key` and `value` string keys are required. 3. `Document.text_template` -> default = `{metadata_str}\n\n{content}` Once your metadata is converted into a string using `metadata_seperator` and `metadata_template`, this templates controls what that metadata looks like when joined with the text content of your document/node. The `metadata` and `content` string keys are required.

File:docs/src/content/docs/framework/optimizing/basic_strategies/basic_strategies.md (L90-111)

## Metadata Filters Before throwing your documents into a vector index, it can be useful to attach metadata to them. While this metadata can be used later on to help track the sources to answers from the `response` object, it can also be used at query time to filter data before performing the top-k similarity search. Metadata filters can be set manually, so that only nodes with the matching metadata are returned: ```python from llama_index.core import VectorStoreIndex, Document from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter documents = [ Document(text="text", metadata={"author": "LlamaIndex"}), Document(text="text", metadata={"author": "John Doe"}), ] filters = MetadataFilters( filters=[ExactMatchFilter(key="author", value="John Doe")] ) index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine(filters=filters)
http://www.jsqmd.com/news/410010/

相关文章:

  • WebGPU着色器漏洞的隐蔽杀机与防御实践
  • 向沙漠蚂蚁借一双“天眼”:基于ZYNQ的6G仿生偏振智能导航系统
  • P14937 「FAOI-R10」XOR Problem
  • SAP ABAP SQL CASE 套 CASE
  • UWB雷达技术全景解析:从核心原理到应用实践
  • 2026上位机开发全景实战:从技术选型避坑、架构设计到工业场景落地全拆解
  • Linux和Windows不一样,如何实现FastDDS的源码编译?
  • 125页精品PPT | 数据中台应用技术方案介绍
  • 实体本体论的当代困境与对话本体论的建设性思考——为碳硅共生时代奠定思想地基
  • ChaosBlade级联故障注入:测试工程师的云原生稳定性攻防手册
  • 51. N 皇后
  • 131. 分割回文串
  • [特殊字符] CUDA内核功耗波动:测试从业者的性能与能效攻防战
  • 拒绝报价乱象|BH健身房器材报价透明指南,上海杰禾力带你明明白白消费 - 冠顶工业设备
  • 漏洞防御革命:Renovate如何斩断供应链攻击链条?
  • 题解:AcWing 900 整数划分
  • C#中 Invoke、begininvoke、InvokeRequired的详细讲解和三者之间的区别
  • 探寻江西新华电脑学院线上报名入口,人工智能专业特色与教师责任心情况 - 工业品牌热点
  • 基于JSP的高校财务处理系统的设计与实现(11895)
  • AT_arc183_c [ARC183C] Not Argmax
  • C# 的开闭原则(OCP)在工控上位机开发中的具体应用
  • 2026年高性价比便携式打印机制造商排名,广州小篆科技值得关注 - 工业推荐榜
  • C#中的反射是什么?详细讲解以及在工控上位机中如何应用
  • 细聊颜语堂英语四六级课程费用,报名流程复杂吗学员评价好吗? - mypinpai
  • CatBoost 高级 API 深度解析:超越默认参数的实战技巧与设计哲学
  • vCenter Server 8.0U3i 新增功能简介
  • 深度测评做品牌咨询的公司哪家专业:全案能力+落地深度(防坑指南) - 品牌排行榜
  • 求职必看:纽约的数据分析岗位在哪里投递申请?(高效渠道盘点) - 品牌排行榜
  • 题解:AcWing 282 石子合并
  • 深度测评满意度调研网站哪个好用:头部机构对比(指南) - 品牌排行榜