当前位置：首页 > news >正文

Focus-Scan-Refine From Human Visual Perception to Efficient Visual Token Pruning

news 2026/3/27 3:56:04

Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning

Authors:Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu

Deep-Dive Summary:

Focus-Scan-Refine: 从人类视觉感知到高效视觉 Token 剪枝

摘要

视觉语言模型（VLMs）通常会产生海量的视觉 Token，这显著增加了推理延迟和显存占用。虽然无需训练的 Token 剪枝提供了一种实际的解决途径，但现有的方法在极端压缩条件下仍难以平衡局部证据（local evidence）和全局上下文（global context）。本文提出了Focus-Scan-Refine (FSR)框架，这是一个受人类启发、即插即用的剪枝框架，它模拟了人类回答视觉问题的过程：首先**聚焦（Focus）关键证据，然后在需要时扫描（Scan）全局，最后通过聚合相关细节来精炼（Refine）**扫描得到的上下文。

FSR 首先结合视觉显著性和指令相关性来聚焦关键证据，避免了对视觉显著但与查询无关区域的偏见；接着，它根据已聚焦的集合扫描补充上下文，选择与聚焦证据差异最大的 Token；最后，FSR 在不增加 Token 预算的情况下，通过基于相似性的分配和分数加权合并，将附近的注入 Token 聚合到扫描锚点中。实验表明，FSR 在多个 VLM 基准测试中均显著优于现有的 SOTA 剪枝方法。

2. 相关工作

基于注意力的剪枝：如 FastV、LLaVA-PruMerge 和 SparseVLM。这些方法利用跨注意力或 [CLS] 注意力来评估 Token 重要性，但容易偏向显著区域，忽略细微的全局信息。
基于相似性的剪枝：如 DivPrune 和 DART。这些方法通过特征空间中的多样性选择来减少冗余，侧重于全局覆盖，但往往忽略了精确推理所需的细粒度局部细节。
联合注意力-相似性剪枝：如 VisionZip 和 CDPruner。虽然它们尝试权衡两者，但在 Token 预算极度有限时，仍难以同时保留最关键的局部证据和必要的全局上下文。

4. 实验

4.1 实验设置

模型：LLaVA 系列 (1.5, NeXT, Video) 以及 Qwen2.5-VL。
基准：包括 VQAv2、GQA、ScienceQA、POPE、MME、MMBench 等图像任务，以及 MLVU、MVBench 等视频任务。
默认参数：α = 3 , β = 1 , ρ = 0.9 , κ = 1 \alpha = 3, \beta = 1, \rho = 0.9, \kappa = 1α=3,β=1,ρ=0.9,κ=1。

4.2 主要结果

如下表所示，在 LLaVA-1.5-7B 上，当保留 64 个 Token（压缩掉 88.9%）时，FSR 在 MM-Vet 上的表现优于所有竞争方法，且在各基准测试中保持了最高且最稳定的平均性能。

表 1：LLaVA-1.5-7B 上不同剪枝方法的性能对比。

方法	VQA V2	GQA	POPE	MME	Avg.
LLaVA-1.5-7B (100% tokens)	78.5	61.9	85.9	1862	100%
保留 192 Tokens
CDPruner (NIPS25)	77.2	60.3	87.3	1784	98.5%
FSR (Ours)	77.4	60.2	87.1	1803	99.1%
保留 64 Tokens
CDPruner (NIPS25)	75.4	58.6	87.5	1710	95.7%
FSR (Ours)	75.4	58.2	85.7	1701	96.1%
以下是该论文部分的中文摘要，保留了原始的 Markdown 表格格式，并确保所有数学表达式符合格式要求。

4.2.1 标准基准测试中的 FSR

我们首先在 LLaVA-1.5-7B 上评估了 FSR。表 1 展示了在三种标记预算（保留 192、128 和 64 个视觉标记，分别对应66.7 % 66.7\%66.7%、77.8 % 77.8\%77.8%和88.9 % 88.9\%88.9%的减少率）下不同修剪方法的性能。当保留 192 个标记时，FSR 实现了99.1 % 99.1\%99.1%的最高平均分，优于 CDPruner (98.5 % 98.5\%98.5%) 和 VisPruner (98.2 % 98.2\%98.2%)。

Table 2 Performance comparison of different pruning methods on LLaVA-NeXT-7B. Avg. represents the average relative performance maintained across all tested benchmarks compared to the unpruned baseline. The best results are highlighted in bold.

Method	VQA V2	GQA	SQA IMG Upper Bound, All	VQA Text	POPE	MME	MMBEN	MMBCN	MMVet	Avg.
LLaVA-NeXT-7B	81.3	62.5	67.6	60.3	86.8	1883	65.9	57.4	39.2	100.0%
Retain 960 Tokens (↓ 66.7%)
HoloV (NIPS2025)	78.9	61.3	66.2	57.4	86.9	1713	50.9	42.3	34.4	91.7%
VisPruner (ICCV2025)	80.0	62.1	68.2	60.2	87.1	1807	65.8	58.2	38.5	99.2%
CDPruner (NIPS2025)	80.5	62.7	68.5	59.1	87.1	1799	66.9	57.6	39.0	99.4%
FSR	80.5	62.6	68.5	60.3	87.1	1806	66.9	58.3	41.1	100.0%
Retain 640 Tokens (↓ 77.8%)
FastV (ECCV24)	77.0	58.9	67.4	58.1	79.5	1667	63.1	53.5	39.5	94.4%
DivPruner (CVPR25)	79.3	61.9	67.8	57.0	86.9	1734	65.8	57.3	38.0	97.7%
HoloV (NIPS2025)	79.3	61.2	63.8	57.6	86.2	1768	64.3	56.7	38.9	97.0%
VisPruner (ICCV2025)	78.8	61.1	68.3	60.0	85.9	1828	64.9	57.3	38.5	98.5%
CDPruner (NIPS2025)	79.8	62.6	68.0	58.5	87.3	1800	66.2	57.6	41.0	99.3%
FSR	79.7	62.3	67.9	60.0	87.0	1833	66.3	57.9	41.9	99.9%
Retain 320 Tokens (↓ 88.9%)
FastV (ECCV24)	61.5	49.8	66.6	52.2	49.5	1302	53.4	42.5	20.0	74.9%
DivPruner (CVPR25)	77.2	61.1	67.7	56.2	84.7	1687	63.9	55.7	34.8	95.2%
HoloV (NIPS2025)	77.2	59.8	66.2	57.0	83.4	1753	65.5	57.0	36.5	96.0%
VisPruner (ICCV2025)	75.9	58.7	68.6	59.0	81.4	1753	63.8	55.8	36.3	95.4%
CDPruner (NIPS2025)	78.4	61.4	67.7	57.4	87.3	1773	65.4	55.6	36.7	97.3%
FSR	77.9	60.9	68.1	58.1	86.1	1783	64.9	56.1	39.3	97.6%

在保留 64 个标记（减少88.9 % 88.9\%88.9%）的极端设置下，FSR 表现出卓越的稳定性，保留了96.1 % 96.1\%96.1%的原始性能，在 MMVet 和 MMBench-EN 等复杂推理任务中持续领先。这表明 FSR 有效平衡了显著的局部细节与背景上下文，保持了语义的完整性。

4.2.2 高分辨率输入的 FSR

我们将 FSR 应用于 LLaVA-NeXT-7B，并固定输入分辨率为672 × 672 672 \times 672672×672（共 2,880 个视觉标记）。如表 2 所示，在保留 960 个标记（减少66.7 % 66.7\%66.7%）时，FSR 达到了与全标记上限相当的性能。即使在最激进的保留 320 个标记（减少88.9 % 88.9\%88.9%）的情况下，FSR 仍以97.6 % 97.6\%97.6%的性能保持率领先。这证明 FSR 能够有效利用高分辨率图像提供的细粒度特征，在受限的标记预算下保持高精度。

4.2.3 先进架构的 FSR

我们在 Qwen2.5-VL-7B 上对 FSR 进行了评估，该模型本身支持动态分辨率和标记合并。尽管基准更强，FSR 仍实现了最佳的精度-效率权衡。在标记减少80 % 80\%80%和90 % 90\%90%的情况下，FSR 分别保留了91.9 % 91.9\%91.9%和84.0 % 84.0\%84.0%的原始性能，显著优于 HoloV 和 FastV。在 MMVet 和 MME 等需要综合多模态推理的基准测试上，FSR 的优势尤为明显。

4.2.4 视频理解的 FSR

在 LLaVA-Video-7B-Qwen2 上的测试显示，FSR 在50 % 50\%50%到80 % 80\%80%的修剪比例下均优于 HoloV。特别是在60 % 60\%60%的修剪率下，FSR 保留了99.6 % 99.6\%99.6%的原始性能。这表明 FSR 将平衡局部证据与全局上下文的策略成功扩展到了时间维度，能够稳健地保留关键的时空线索。

4.2.5 大规模模型的 FSR

在 LLaVA-1.5-13B 和 LLaVA-NeXT-13B 上的结果如表 5 和表 6 所示。在 LLaVA-NeXT-13B 中，当保留 640 个标记（减少77.8 % 77.8\%77.8%）时，FSR 的平均得分达到101.7 % 101.7\%101.7%，甚至略高于未修剪的基准。这表明 FSR 通过过滤冗余标记减少了噪声，从而实现了更准确的推理。

Table 3 Performance comparison of different pruning methods on Qwen2.5-VL-7B. Avg. represents the average relative performance maintained across all tested benchmarks compared to the unpruned baseline. The best results are highlighted in bold.

Method	GQA	SQAIMG	VQAText	POPE	MME	MMBEN	MMBCN	MMVet	Avg.
Qwen2.5-VL-7B	60.8	88.9	77.6	86.5	2328	83.5	81.4	64.4	100.0%
Reduction Ratio: ↓ 80%
FastV (ECCV24)	56.8	83.1	70.7	81.0	2102	76.8	75.4	57.4	92.0%
HoloV (NIPS2025)	59.5	87.8	73.8	85.1	2179	81.1	78.9	55.5	95.6%
FSR	60.2	87.9	76.0	86.1	2258	81.5	79.1	61.7	97.9%
Reduction Ratio: ↓ 60%
FastV (ECCV24)	56.3	83.1	68.8	80.2	2063	75.7	73.5	51.4	89.8%
HoloV (NIPS2025)	59.0	87.2	71.9	84.4	2177	79.7	77.8	52.1	94.2%
FSR	59.9	87.5	75.1	85.2	2227	80.3	78.5	57.5	96.4%
Reduction Ratio: ↓ 80%
FastV (ECCV24)	54.2	82.2	61.0	77.5	1915	72.5	70.0	44.7	84.6%
HoloV (NIPS2025)	57.1	86.0	64.5	81.3	2008	76.3	73.4	45.3	88.6%
FSR	58.3	86.7	70.3	83.2	2089	78.7	74.9	49.8	91.9%
Reduction Ratio: ↓ 90%
FastV (ECCV24)	50.8	80.0	53.0	72.2	1794.7	68.2	65.1	37.1	78.3%
HoloV (NIPS2025)	53.6	84.4	55.7	76.4	1831	72.3	68.9	38.9	82.1%
FSR	54.1	84.5	61.0	77.3	1907	71.7	68.3	41.4	84.0%

Table 4 Performance comparison of different pruning methods on LLaVA-Video-7B-qwen2 with 32 frames per video. Avg. represents the average percentage of performance maintained. “w/o” and “w/” indicate without and with subtitles.

Method Metric	MMVU val	MMWorld test	MLVU test	MVBench test	all+w/o	all+w/	long	Avg.
Upper Bound: All Tokens (100%)
LLaVA-Video-7B-qwen2	44.0	30.0	50.1	60.8	62.6	62.4	51.8	100%
Reduction Ratio: ↓ 50%
HoloV (NIPS2025)	44.2	31.5	49.1	59.4	61.7	61.6	51.3	99.2%
FSR	46.0	31.1	50.2	59.7	61.9	62.0	51.6	100.3%
Reduction Ratio: ↓ 60%
HoloV (NIPS2025)	43.4	30.8	49.1	59.3	61.4	61.0	51.3	98.5%
FSR	44.6	31.1	50.0	59.4	61.6	61.5	52.2	99.6%
Reduction Ratio: ↓ 70%
HoloV (NIPS2025)	43.7	31.0	48.5	59.0	60.6	61.2	51.2	98.2%
FSR	44.6	31.6	47.6	59.2	61.3	61.5	52.0	98.9%
Reduction Ratio: ↓ 80%
HoloV (NIPS2025)	44.0	32.9	46.5	58.3	60.4	60.8	51.6	98.0%
FSR	43.4	33.3	46.5	58.5	60.2	60.9	52.3	98.2%

4.3 效率分析

在单张 NVIDIA RTX 3090 GPU 上，当仅保留 64 个标记时，FSR 显著节省了资源：FLOPs 减少了约75 % 75\%75%，KV 缓存内存压缩了近9 × 9 \times9×，预填充阶段实现了3.9 × 3.9 \times3.9×的提速。FSR 在所有对比方法中实现了最佳的精度-效率权衡，具有最低的解码延迟（22.317 ms），且引入的系统开销微乎其微。

Original Abstract:Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR

PDF Link:2602.05809v1