当前位置：首页 > news >正文

llama-index 相似度会把文档的名称也计算吗

news 2026/7/8 4:44:11

简要结论

默认情况下，文档名称（通常作为metadata中的file_name）会被注入到节点文本中，参与嵌入生成，从而影响相似度计算；你可以通过excluded_embed_metadata_keys将其排除，使相似度仅基于正文内容计算 1 2 。

详细说明

默认行为：元数据参与嵌入

Document的metadata（包括file_name）会被节点继承，并在生成嵌入时默认注入到文本中，因此相似度计算会包含这些元数据信息 1 。
注入格式由metadata_seperator、metadata_template与text_template控制，默认为"{key}: {value}\n\n{content}"3 。

如何排除文档名称不影响相似度

设置document.excluded_embed_metadata_keys = ["file_name"]可让嵌入模型忽略file_name，从而相似度仅基于正文与其他未被排除的元数据 2 。
若仍希望 LLM 在生成时看到文件名，可单独设置excluded_llm_metadata_keys，不影响嵌入计算 4 。

示例

fromllama_index.coreimportDocument doc=Document(text="正文内容",metadata={"file_name":"report.pdf"},excluded_embed_metadata_keys=["file_name"],# 嵌入时忽略文件名excluded_llm_metadata_keys=[],# LLM 仍可见（可选）)

检索时的元数据过滤（不影响相似度）

你可以在检索时用filters按file_name过滤候选节点，但这属于预筛选，不改变嵌入向量与相似度计算逻辑 5 。

Notes

若使用SimpleDirectoryReader(file_metadata=filename_fn)，文件名会自动写入metadata，默认参与嵌入；如需排除，请在构建Document后统一设置excluded_embed_metadata_keys6 。
节点在索引时已按上述策略生成嵌入，检索时retrieve方法直接使用这些嵌入向量进行相似度比较，不会临时重新判断元数据是否参与 7 。

Citations

File:docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L36-41)

### Metadata Documents also offer the chance to include useful metadata. Using the `metadata` dictionary on each document, additional information can be included to help inform responses and track down sources for query responses. This information can be anything, such as filenames or categories. If you are integrating with a vector database, keep in mind that some vector databases require that the keys must be strings, and the values must be flat (either `str`, `float`, or `int`). Any information set in the `metadata` dictionary of each document will show up in the `metadata` of each source node created from the document. Additionally, this information is included in the nodes, enabling the index to utilize it on queries and responses. By default, the metadata is injected into the text for both embedding and LLM model calls.

File:docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L59-70)

3. Set the filename automatically using the `SimpleDirectoryReader` and `file_metadata` hook. This will automatically run the hook on each document to set the `metadata` field: ```python from llama_index.core import SimpleDirectoryReader filename_fn = lambda filename: {"file_name": filename} # automatically sets the metadata of each document according to filename_fn documents = SimpleDirectoryReader( "./data", file_metadata=filename_fn ).load_data()

**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L95-103) ```markdown #### Customizing LLM Metadata Text Typically, a document might have many metadata keys, but you might not want all of them visible to the LLM during response synthesis. In the above examples, we may not want the LLM to read the `file_name` of our document. However, the `file_name` might include information that will help generate better embeddings. A key advantage of doing this is to bias the embeddings for retrieval without changing what the LLM ends up reading. We can exclude it like so: ```python document.excluded_llm_metadata_keys = ["file_name"]

**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L113-119) ```markdown #### Customizing Embedding Metadata Text Similar to customizing the metadata visible to the LLM, we can also customize the metadata visible to embeddings. In this case, you can specifically exclude metadata visible to the embedding model, in case you DON'T want particular text to bias the embeddings. ```python document.excluded_embed_metadata_keys = ["file_name"]

**File:** docs/src/content/docs/framework/module_guides/loading/documents_and_nodes/usage_documents.md (L131-143) ```markdown As you know by now, metadata is injected into the actual text of each document/node when sent to the LLM or embedding model. By default, the format of this metadata is controlled by three attributes: 1. `Document.metadata_seperator` -> default = `"\n"` When concatenating all key/value fields of your metadata, this field controls the separator between each key/value pair. 2. `Document.metadata_template` -> default = `"{key}: {value}"` This attribute controls how each key/value pair in your metadata is formatted. The two variables `key` and `value` string keys are required. 3. `Document.text_template` -> default = `{metadata_str}\n\n{content}` Once your metadata is converted into a string using `metadata_seperator` and `metadata_template`, this templates controls what that metadata looks like when joined with the text content of your document/node. The `metadata` and `content` string keys are required.

File:docs/src/content/docs/framework/optimizing/basic_strategies/basic_strategies.md (L90-111)

## Metadata Filters Before throwing your documents into a vector index, it can be useful to attach metadata to them. While this metadata can be used later on to help track the sources to answers from the `response` object, it can also be used at query time to filter data before performing the top-k similarity search. Metadata filters can be set manually, so that only nodes with the matching metadata are returned: ```python from llama_index.core import VectorStoreIndex, Document from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter documents = [ Document(text="text", metadata={"author": "LlamaIndex"}), Document(text="text", metadata={"author": "John Doe"}), ] filters = MetadataFilters( filters=[ExactMatchFilter(key="author", value="John Doe")] ) index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine(filters=filters)

查看全文

http://www.jsqmd.com/news/410010/