当前位置：首页 > news >正文

Concept-Enhanced Multimodal RAG Towards Interpretable and Accurate Radiology Report Generation

news 2026/7/4 17:47:16

Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation

Authors:Marco Salmè, Federico Siciliano, Fabrizio Silvestri, Paolo Soda, Rosa Sicilia, Valerio Guarrasi

Deep-Dive Summary:

概念增强的多模态 RAG：迈向可解释且准确的放射学报告生成

摘要

利用视觉语言模型（VLMs）进行放射学报告生成（RRG）有望减轻文档负担、提高报告一致性并加速临床工作流程。然而，由于缺乏可解释性以及容易产生与影像证据不符的“幻觉”发现，其临床应用仍然受限。现有的研究通常将可解释性和准确性视为独立的目标，基于概念的可解释性技术主要关注透明度，而检索增强生成（RAG）方法则通过外部检索针对事实性依据（factual grounding）。我们提出了概念增强型多模态 RAG（CEMRAG），这是一个统一的框架，它将视觉表示分解为可解释的临床概念，并将其与多模态 RAG 集成。该方法利用丰富的上下文提示进行 RRG，同时提高了可解释性和事实准确性。在 MIMIC-CXR 和 IU X-Ray 数据集上针对多种 VLM 架构、训练方案和检索配置进行的实验表明，在临床准确性指标和标准 NLP 度量上，该方法始终优于传统 RAG 和仅基于概念的基准。这些结果挑战了公认的可解释性与性能之间的权衡假设，表明透明的视觉概念可以增强而非损害医学 VLM 的诊断准确性。我们的模块化设计将可解释性分解为视觉透明度和结构化语言模型调节，为构建临床可信的 AI 辅助放射学提供了一条原则性路径。项目页面见 https://github.com/marcosal30/cemrag-rrg。

关键词：放射学报告生成，视觉语言模型，医学影像，可解释性，检索增强生成，多模态 AI

2 相关工作

2.1 视觉语言模型的可解释性

VLMs 的可解释性对于临床部署至关重要。目前的方法分为隐式解释机制（如理性生成、思维链推理）和显式概念表示。隐式方法往往只是“看似合理”的辩解，而非计算机制的真实反映。显式概念表示（如概念瓶颈模型）虽然透明，但需要大量人工标注。

最近的研究（如 SpLiCE）通过将视觉表示分解为特定领域词汇表中的临床概念，实现了可扩展且透明的解释，而无需牺牲表示的灵活性。

2.2 医学领域的多模态检索增强生成

多模态 RAG 通过在现有临床知识中锚定生成过程，缓解了医学 VLM 中的事实幻觉。在 RRG 领域，MMed-RAG 和 RULE 等框架通过提供具体的临床示例显著减少了幻觉。

尽管 RAG 提供了间接的可解释性，但这种透明度是无源的。检索通常通过全局相似性匹配进行，缺乏关于应优先考虑哪些解剖结构或病理模式的显式指导。

2.3 局限性与动机

现有的方法将透明度和事实准确性视为分离的目标。本研究的中心假设是：可解释的视觉概念可以作为语义引导机制，通过引导检索和生成向输入图像中的临床相关内容靠拢，同时增强透明度和准确性。

4 实验设置

4.1 数据集

MIMIC-CXR：大规模数据集，包含超过 370,000 张胸部 X 光片。我们使用了 156,344 张正面视图。
IU X-ray：较小的数据集，包含 7,470 张图像。我们使用了 3,307 张正面投影。

4.2 模型配置与实验条件

我们使用CXR-CLIP作为基础对齐模型，并采用SpLiCE进行概念提取。
评估了两种架构配置：

LLaVA-Med：视觉编码器和 LLM（Mistral-7B）均经过医学预训练。
CXR-CLIP + Mistral-7B：使用医学预训练的 CLIP 配对基础 Mistral-7B。

我们评估了四种提示策略：

仅图像 (Image-Only)：仅使用视觉特征。
概念 (Concepts)：加入从 SpLiCE 提取的 5 个医学关键词。
多模态 RAG (Multimodal RAG)：加入 3 个相似案例的报告。对于 MIMIC-CXR，进行域内检索；对于 IU X-ray，由于数据量小，进行跨域检索（从 MIMIC-CXR 中检索）。

Original Abstract:Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.

PDF Link:2602.15650v1