[Paper Review] LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Posted by Euisuk's Dev Log on December 12, 2025

[Paper Review] LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

https://arxiv.org/pdf/2403.15388

๋…ผ๋ฌธ ์ •๋ณด


  1. Introduction: LMM ํšจ์œจ์„ฑ์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„

1.1 Large Multimodal Models (LMMs)์˜ ๋“ฑ์žฅ

  • Large Language Models (LLMs)์€ GPT-4, LLaMA, Mistral ๋“ฑ์—์„œ ๋ณด๋“ฏ ๊ฐ•๋ ฅํ•œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ LLM์€ ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ์ฝ”ํผ์Šค๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ๊ณ ์šฉ๋Ÿ‰ Transformer ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.
  • Large Multimodal Models (LMMs)์€ LLM์˜ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๊ณ„์Šนํ•˜๋ฉด์„œ, CLIP-ViT ๊ฐ™์€ Vision Encoder๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ visual tokens์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด visual tokens์€ LLM์˜ prefix context๋กœ ์ž…๋ ฅ๋˜์–ด ์‹œ๊ฐ์  ์ถ”๋ก ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
1
2
[Vision Encoder] โ†’ Visual Tokens (prefix) โ†’ [LLM] โ†’ ํ…์ŠคํŠธ ์‘๋‹ต
(CLIP-ViT)         (576๊ฐœ)                  (Vicuna/LLaMA)

1.2 LMM์˜ ๊ณ„์‚ฐ ๋น„์šฉ ๋ฌธ์ œ

LMM์€ ์ถ”๋ก (inference)์— ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋น„์šฉ์˜ ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•˜๋ฉด:

๊ตฌ์„ฑ์š”์†Œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋น„๊ณ 
Vision Encoder (ViT-L) ~0.3B ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์Œ
LLM (LLaMA/Vicuna) 7B~13B ์ฃผ์š” ๋น„์šฉ ์›์ธ

๐Ÿ”Ž ํ•ต์‹ฌ ํ†ต์ฐฐ: Vision Encoder๋Š” LLM์— ๋น„ํ•ด ๋งค์šฐ ์ž‘์œผ๋ฏ€๋กœ, LLM์˜ ์ถ”๋ก  ๋น„์šฉ์„ ์ค„์ด๋Š” ๊ฒƒ์ด ์ „์ฒด LMM ํšจ์œจํ™”์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.


1.3 ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•๊ณผ ํ•œ๊ณ„

์ด์ „ ์—ฐ๊ตฌ๋“ค์€ ์ด๋Ÿฌํ•œ LLM ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ์‹œ๋„๋“ค์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ ‘๊ทผ๋ฒ• ๋ฐฉ๋ฒ• ํ•œ๊ณ„
Small LLM ์‚ฌ์šฉ Phi-2 ๊ธฐ๋ฐ˜ MobileVLM, TinyGPT-V LLM ์ถ”๋ก  ๋Šฅ๋ ฅ ํฌ์ƒ, VQAv2/MMBench์—์„œ ํฐ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ
Quantization 4-bit, 8-bit ์••์ถ• ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ์ค„์ง€๋งŒ ๋‹ค๋ฅธ ๋ฌธ์ œ ๋ฏธํ•ด๊ฒฐ

1.4 ๊ฐ„๊ณผ๋œ ๋น„์šฉ ์›์ฒœ: Input Context Length

ํ•˜์ง€๋งŒ, ์œ„ ์—ฐ๊ตฌ๋“ค์—์„œ ๊ฐ„๊ณผํ•œ ๋‚ด์šฉ์œผ๋กœ โ€œLLM์˜ ๋น„์šฉ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ž…๋ ฅ ์ปจํ…์ŠคํŠธ ๊ธธ์ด์—์„œ๋„ ๋ฐœ์ƒํ•œ๋‹ค.โ€๋ผ๋Š” ์‚ฌ์‹ค์„ ์ง€์ ํ•ฉ๋‹ˆ๋‹ค.

LLM = Transformer ์•„ํ‚คํ…์ฒ˜:
LLM์€ Transformer ๊ธฐ๋ฐ˜์ด๋ฉฐ, ํ•ต์‹ฌ ์—ฐ์‚ฐ์ธ Self-Attention์€ ์ž…๋ ฅ๋œ ๋ชจ๋“  ํ† ํฐ ์Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

Attention(Q,K,V)=softmax(QKTdk)โ‹…V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot VAttention(Q,K,V)=softmax(dkโ€‹โ€‹QKTโ€‹)โ‹…V

์—ฌ๊ธฐ์„œ QKTQK^TQKT ์—ฐ์‚ฐ์€ N ร— N ์–ดํ…์…˜ ๋งคํŠธ๋ฆญ์Šค๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค (N = ์ž…๋ ฅ ํ† ํฐ ์ˆ˜). ๋”ฐ๋ผ์„œ Self-Attention์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๋Š” ์ž…๋ ฅ ๊ธธ์ด์— ๋Œ€ํ•ด O(Nยฒ)์ž…๋‹ˆ๋‹ค.

LMM์—์„œ์˜ ๋ฌธ์ œ:

  • LMM์€ ๊ณ ์ •๋œ ๋Œ€๋Ÿ‰์˜ visual tokens์„ prefix๋กœ ์‚ฌ์šฉ
    • LLaVA-1.5: 576 visual tokens
    • Video-LLaVA: 2,048+ tokens (๊ณ ํ•ด์ƒ๋„/๋น„๋””์˜ค ์ฒ˜๋ฆฌ ์‹œ)
  • ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋กœ ์ธํ•˜์—ฌ Visual tokens ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก LLM์˜ ์–ดํ…์…˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ œ๊ณฑ์œผ๋กœ ์ฆ๊ฐ€

๐Ÿ”Ž ํ•ต์‹ฌ ์งˆ๋ฌธ: Prefix visual tokens์˜ ์ˆ˜๋ฅผ ์ค„์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?


1.5 ํ•ต์‹ฌ ๊ด€์ฐฐ: Visual Tokens์˜ Redundancy

๋ณธ ์—ฐ๊ตฌ์—์„œ ๋ฐœ๊ฒฌํ•œ ์ค‘์š”ํ•œ ํ˜„์ƒ:

๊ด€์ฐฐ 1: Sparse Attention Distribution

Vision Encoder์˜ self-attention์—์„œ [CLS] ํ† ํฐ๊ณผ spatial patches ๊ฐ„์˜ ์–ดํ…์…˜์ด sparseํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์†Œ์ˆ˜์˜ visual tokens๋งŒ์ด ํ•ต์‹ฌ ์‹œ๊ฐ ์ •๋ณด์™€ ์—ฐ๊ด€๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๊ด€์ฐฐ 2: ๋Œ€๋ถ€๋ถ„์˜ Visual Tokens์€ Redundant

๊ธฐ์กด ์—ฐ๊ตฌ(Bolya et al., 2023; Liu et al., 2022)์™€ ์ผ๊ด€๋˜๊ฒŒ, ๋Œ€๋ถ€๋ถ„์˜ visual tokens์€ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์ œ๊ฑฐ(prune)๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


1.6 ์ œ์•ˆ ๋ฐฉ๋ฒ•: PruMerge ๊ฐœ์š”

์ด๋Ÿฌํ•œ sparse similarity๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ค‘์š”ํ•œ visual tokens์„ ์ ์‘์ ์œผ๋กœ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

PruMerge์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด:

  1. Adaptive Token Selection (Prune)

    • Interquartile Range (IQR) ๊ธฐ๋ฐ˜ outlier detection
    • [CLS] ์–ดํ…์…˜ ๊ฐ’์ด ๋†’์€ ํ† ํฐ์„ ์ค‘์š” ํ† ํฐ์œผ๋กœ ์„ ๋ณ„
  2. Token Merging (Merge)

    • IQR๋กœ 32๊ฐœ ํ† ํฐ๋งŒ ์„ ํƒํ•˜๋ฉด, ๋‚˜๋จธ์ง€ 544๊ฐœ ํ† ํฐ์˜ ์ •๋ณด๊ฐ€ ์™„์ „ํžˆ ์†์‹ค๋  ์ˆ˜ ์žˆ์Œ
    • ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด k-nearest neighbor ๊ธฐ๋ฐ˜ ํด๋Ÿฌ์Šคํ„ฐ๋ง
    • ์„ ํƒ๋œ ํ† ํฐ์„ weighted averaging์œผ๋กœ ์—…๋ฐ์ดํŠธ
    • ์ œ๊ฑฐ๋œ ํ† ํฐ์˜ ์ •๋ณด๋ฅผ ๋ณด์กด์„ ๋ชฉ์ ์œผ๋กœ ํ•จ
  3. PruMerge+ (ํ™•์žฅ)

    • ๋„ˆ๋ฌด ๊ณต๊ฒฉ์ ์ธ ์••์ถ•์œผ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ
    • Spatial-uniform sampling ์ถ”๊ฐ€
    • ๋” ํฌ๊ด„์ ์ด๊ณ  ๋Œ€ํ‘œ์„ฑ ์žˆ๋Š” ํ† ํฐ ์„ ํƒ ๋ณด์žฅ

๐Ÿ“ท ๋ณธ ๋…ผ๋ฌธ์˜ ๊ธฐ์—ฌ์ 

  1. Visual token redundancy ๋ถ„์„: [CLS]-spatial attention์˜ sparsity ๊ด€์ฐฐ
  2. PruMerge ์ œ์•ˆ: ์ ์‘์  ํ† ํฐ ์„ ํƒ ๋ฐ ๋ณ‘ํ•ฉ ์ „๋žต
  3. Plug-and-play ์ ์šฉ: ๊ธฐ์กด LMM์— ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ์ ์šฉ ๊ฐ€๋Šฅ
  4. ๋‹ค์–‘ํ•œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ํ™•์žฅ: ์ด๋ฏธ์ง€๋ฟ ์•„๋‹ˆ๋ผ ๋น„๋””์˜ค(Video-LLaVA)์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ

2.1 Efficient Large Multimodal Models

1. Compact Architecture (์ž‘์€ ๋ชจ๋ธ ์‚ฌ์šฉ)

MobileVLM / MobileVLM-v2

๋ชฉํ‘œ: ๋ชจ๋ฐ”์ผ ๋””๋ฐ”์ด์Šค์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ LMM

https://arxiv.org/abs/2402.03766

1
2
3
4
์ผ๋ฐ˜ LMM:     Vision Encoder โ†’ [Vicuna-7B] โ†’ ์‘๋‹ต
MobileVLM:   Vision Encoder โ†’ [MobileLLaMA-1.4B] โ†’ ์‘๋‹ต
                                    โ†‘
                              5๋ฐฐ ์ž‘์€ LLM

ํŠน์ง•:

  • ๋ชจ๋ฐ”์ผ ์ตœ์ ํ™” LLM backbone ์‚ฌ์šฉ
  • ๊ฒฝ๋Ÿ‰ํ™”๋œ projector ์„ค๊ณ„

TinyGPT-V

๋ชฉํ‘œ: ์ž‘์€ LLM์œผ๋กœ๋„ ์ข‹์€ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

https://arxiv.org/abs/2312.16862

1
2
3
4
๊ธฐ์กด LLaVA:  Vision Encoder โ†’ [Vicuna-7B] โ†’ ์‘๋‹ต
TinyGPT-V:  Vision Encoder โ†’ [Phi-2 (2.7B)] โ†’ ์‘๋‹ต
                                   โ†‘
                            Microsoft์˜ ์†Œํ˜• LLM

ํŠน์ง•:

  • Phi-2์˜ ๊ฐ•๋ ฅํ•œ reasoning ๋Šฅ๋ ฅ ํ™œ์šฉ
  • 7B ๋Œ€๋น„ ์•ฝ 3๋ฐฐ ์ž‘์€ ๋ชจ๋ธ

LLaVA-Phi

๋ชฉํ‘œ: Phi ๊ธฐ๋ฐ˜ ํšจ์œจ์  LMM

https://dl.acm.org/doi/abs/10.1145/3688863.3689575

1
Vision Encoder โ†’ [Phi-2] โ†’ ์‘๋‹ต

ํŠน์ง•:

  • ์ž‘์€ backbone + ํ–ฅ์ƒ๋œ vocabulary
  • ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ถ”๊ตฌ

TinyLLaVA

๋ชฉํ‘œ: ์•„ํ‚คํ…์ฒ˜ ์„ ํƒ๊ณผ ํ•™์Šต ์ตœ์ ํ™” ์—ฐ๊ตฌ

https://arxiv.org/abs/2402.14289

ํƒ๊ตฌ ๋‚ด์šฉ:

  • ์–ด๋–ค Vision Encoder๊ฐ€ ์ตœ์ ์ธ๊ฐ€?
  • ์–ด๋–ค Projector๊ฐ€ ์ตœ์ ์ธ๊ฐ€?
  • ์–ด๋–ค ํ•™์Šต ์ „๋žต์ด ์ตœ์ ์ธ๊ฐ€?

๊ฒฐ๋ก : ์ž‘์€ ๋ชจ๋ธ๋„ ์ตœ์ ํ™”ํ•˜๋ฉด ํฐ ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ ๊ฐ€๋Šฅ


MoE-LLaVA

๋ชฉํ‘œ: Mixture of Experts๋กœ ํšจ์œจ์„ฑ ํ–ฅ์ƒ

https://arxiv.org/abs/2401.15947

1
2
3
4
5
์ผ๋ฐ˜ LLM:    ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•ญ์ƒ ํ™œ์„ฑํ™”

MoE-LLM:    Expert 1  Expert 2  Expert 3  Expert 4
                โ†‘         โ†‘
            Router๊ฐ€ ์„ ํƒํ•œ Expert๋งŒ ํ™œ์„ฑํ™” (sparse)

ํŠน์ง•:

  • ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋งŽ์ง€๋งŒ, ์ถ”๋ก  ์‹œ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉ
  • ๊ณ„์‚ฐ๋Ÿ‰ ๊ฐ์†Œ + ์„ฑ๋Šฅ ์œ ์ง€

2. Quantization & Compression

4/8-bit Quantization

๋ชฉํ‘œ: ํŒŒ๋ผ๋ฏธํ„ฐ ์ •๋ฐ€๋„๋ฅผ ๋‚ฎ์ถฐ ๋ฉ”๋ชจ๋ฆฌ/์—ฐ์‚ฐ ์ ˆ์•ฝ

1
2
3
4
5
6
7
8
๊ธฐ์กด (FP16):
W = [0.1234, -0.5678, 0.9012, ...]  โ† ๊ฐ ์ˆซ์ž๊ฐ€ 16 bits

INT8 Quantization:
W = [0.12, -0.57, 0.90, ...]        โ† ๊ฐ ์ˆซ์ž๊ฐ€ 8 bits๋กœ ๊ทผ์‚ฌ

INT4 Quantization:  
W = [0.1, -0.6, 0.9, ...]           โ† ๊ฐ ์ˆซ์ž๊ฐ€ 4 bits๋กœ ๊ทผ์‚ฌ

ํ•œ๊ณ„:

  • ์••์ถ• ๋Œ€์ƒ: ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ (weights)
  • ํ† ํฐ ์ˆ˜๋Š” ๊ทธ๋Œ€๋กœ โ†’ attention ์—ฐ์‚ฐ๋Ÿ‰ ๋™์ผ

3. Vision-Language Connectors

Vision Encoder ์ถœ๋ ฅ์„ LLM ์ž…๋ ฅ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ชจ๋“ˆ๋“ค์ž…๋‹ˆ๋‹ค.

MLP Projector (LLaVA)

https://arxiv.org/abs/2304.08485

1
Visual Token (1024-dim) โ†’ [Linear โ†’ GELU โ†’ Linear] โ†’ LLM Token (4096-dim)

ํŠน์ง•:

  • ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ
  • ํ† ํฐ ์ˆ˜ ๋ณ€ํ™” ์—†์Œ (576 โ†’ 576)

Q-Former (BLIP-2)

https://arxiv.org/abs/2301.12597

1
2
3
4
5
Visual Tokens (576๊ฐœ)
        โ†“
   [Q-Former]  โ† Learnable Query Tokens (32๊ฐœ)์™€ cross-attention
        โ†“
Query Outputs (32๊ฐœ)

ํŠน์ง•:

  • Learnable query๊ฐ€ visual ์ •๋ณด๋ฅผ โ€œ์งˆ์˜โ€
  • ํ† ํฐ ์ˆ˜ ๊ฐ์†Œ (576 โ†’ 32)
  • ํ•˜์ง€๋งŒ ๊ณ ์ •๋œ ์ˆ˜์˜ query ์‚ฌ์šฉ (adaptive ์•„๋‹˜)

Resampler (Flamingo)

https://arxiv.org/abs/2204.14198

1
2
3
4
5
Visual Tokens (๊ฐ€๋ณ€)
        โ†“
   [Perceiver Resampler]  โ† Latent Tokens์™€ cross-attention
        โ†“
Fixed-size Output (64๊ฐœ)

ํŠน์ง•:

  • ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„ ์ž…๋ ฅ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
  • ๊ณ ์ •๋œ ์ˆ˜์˜ ์ถœ๋ ฅ ํ† ํฐ

Connector ๋น„๊ต

Connector ์ž…๋ ฅ ํ† ํฐ ์ถœ๋ ฅ ํ† ํฐ Adaptive?
MLP (LLaVA) 576 576 โœ—
Q-Former (BLIP-2) 576 32 โœ— (๊ณ ์ •)
Resampler (Flamingo) ๊ฐ€๋ณ€ 64 โœ— (๊ณ ์ •)
PruMerge 576 ์•ฝ 32(์œ ๋™์ ) โœ“ (์ ์‘์ )

2.2 Token Reduction Methods

Sparse Attention

Linformer

  • ๋ฌธ์ œ: Self-attention์˜ O(Nยฒ) ๋ณต์žก๋„
  • ํ•ด๊ฒฐ: Key, Value๋ฅผ ์ €์ฐจ์›์œผ๋กœ projection

https://arxiv.org/abs/2006.04768

1
2
3
4
5
6
7
๊ธฐ์กด Attention:
Q (Nร—d) @ K^T (dร—N) = Nร—N ํ–‰๋ ฌ  โ†’ O(Nยฒ)

Linformer:
K' = E @ K  (kร—d, where k << N)
V' = F @ V  (kร—d)
Q @ K'^T = Nร—k ํ–‰๋ ฌ  โ†’ O(Nร—k) โ‰ˆ O(N)

ํ•œ๊ณ„: LMM์— ์ง์ ‘ ์ ์šฉ ์–ด๋ ค์›€ (prefix ๊ตฌ์กฐ)


ReFormer (Reformer)

  • ๋ฌธ์ œ: ๊ธด ์‹œํ€€์Šค์˜ attention ๋น„์šฉ
  • ํ•ด๊ฒฐ: Locality-Sensitive Hashing (LSH) ์‚ฌ์šฉ

https://arxiv.org/pdf/2001.04451

1
2
3
4
5
๊ธฐ์กด: ๋ชจ๋“  ํ† ํฐ ์Œ์˜ attention ๊ณ„์‚ฐ

ReFormer: 
  1. LSH๋กœ ์œ ์‚ฌํ•œ ํ† ํฐ๋ผ๋ฆฌ bucket ๋ถ„๋ฅ˜
  2. ๊ฐ™์€ bucket ๋‚ด์—์„œ๋งŒ attention ๊ณ„์‚ฐ

1
2
3
4
[ํ† ํฐ๋“ค] โ†’ [LSH Hashing] โ†’ [Bucket 1] [Bucket 2] [Bucket 3]
                              โ†“          โ†“          โ†“
                           Attention  Attention  Attention
                           (๋‚ด๋ถ€๋งŒ)   (๋‚ด๋ถ€๋งŒ)   (๋‚ด๋ถ€๋งŒ)

ํ•œ๊ณ„: ์—ฌ์ „ํžˆ ๋ชจ๋“  ํ† ํฐ ์œ ์ง€, ์—ฐ์‚ฐ ๋ฐฉ์‹๋งŒ ๋ณ€๊ฒฝ


Token Merging

ToMe (Bolya et al., 2023)

  • ๋ชฉํ‘œ: ViT ๋‚ด๋ถ€์—์„œ ์ ์ง„์ ์œผ๋กœ ํ† ํฐ ์ˆ˜ ๊ฐ์†Œ
  • ๋ฐฉ๋ฒ•: Bipartite Matching์œผ๋กœ ์œ ์‚ฌํ•œ ํ† ํฐ ๋ณ‘ํ•ฉ

https://arxiv.org/abs/2210.09461

1
2
3
4
5
6
7
8
ViT Block 1: 576 tokens
      โ†“ (merge)
ViT Block 2: 500 tokens
      โ†“ (merge)
ViT Block 3: 450 tokens
      โ†“ (merge)
...
์ตœ์ข…: 1 token (class token)

Bipartite Matching:

1
2
3
4
5
6
7
8
9
ํ† ํฐ๋“ค์„ ๋‘ ๊ทธ๋ฃน์œผ๋กœ ๋ถ„ํ• :

Group A: [T1, T3, T5, ...]
Group B: [T2, T4, T6, ...]

์œ ์‚ฌํ•œ ์Œ ๋งค์นญ ํ›„ ๋ณ‘ํ•ฉ:
T1 + T2 โ†’ T'1
T3 + T4 โ†’ T'2
...

๊ธฐ์กด Token Reduction vs PruMerge ๋น„๊ต

ํ•ญ๋ชฉ ToMe (๊ธฐ์กด) PruMerge
์ ์šฉ ์œ„์น˜ ViT ๋‚ด๋ถ€ (layer-by-layer) ViT ์ถœ๋ ฅ ํ›„ (ํ•œ ๋ฒˆ์—)
๋ชฉํ‘œ ViT ์—ฐ์‚ฐ ๊ฐ€์† LLM ์—ฐ์‚ฐ ๊ฐ€์†
์ถœ๋ ฅ Single [CLS] token Multiple visual tokens
๊ฐ์†Œ ๋ฐฉ์‹ ์ ์ง„์  (576โ†’500โ†’450โ†’โ€ฆ) ํ•œ ๋ฒˆ์— (576โ†’32)
Adaptive โœ— (๊ณ ์ • ๋น„์œจ) โœ“ (์ด๋ฏธ์ง€๋ณ„ ๋‹ค๋ฆ„)

  1. Method: Token Pru-Merging

3.1 Preliminaries

Vision Transformers (ViTs)

๊ตฌ์กฐ:

1
2
3
4
5
6
7
8
9
10
11
12
Input Image
  โ†“ (Patch embedding)
Patch Tokens (576 tokens for 336ร—336 image with 14ร—14 patches)
  + Class Token ([CLS])
  โ†“
Transformer Blocks (ร—24 for ViT-L/14)
  โ”œโ”€ Multi-head Self-Attention (MSA)
  โ”œโ”€ Feed-Forward Network (FFN)
  โ”œโ”€ Skip connections
  โ””โ”€ Layer Normalization
  โ†“
Output Tokens

Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜:

1
2
3
4
5
6
7
8
# Query, Key, Value ๊ณ„์‚ฐ
Q = X ยท Wq
K = X ยท Wk
V = X ยท Wv

# Attention ๊ณ„์‚ฐ
A = softmax(Q ยท K^T / โˆšdk)
Y = A ยท V

Class Token Attention:

1
2
# [CLS] token๊ณผ visual tokens ๊ฐ„์˜ attention
a_cls = softmax(q_cls ยท K^T / โˆšdk)

ํ•ต์‹ฌ ๊ด€์ฐฐ:

  • a_cls์˜ ๋ถ„ํฌ๊ฐ€ ๋งค์šฐ sparse
  • ์†Œ์ˆ˜์˜ visual tokens๋งŒ ๋†’์€ attention ๊ฐ’
  • ๋Œ€๋ถ€๋ถ„์˜ tokens๋Š” near-zero attention

Large Multimodal Models (LMMs)

Pipeline:

1
2
3
4
5
Image X_v โ†’ [Vision Encoder] โ†’ Z_v โ†’ [Projector W] โ†’ H_v
                                                       โ†“
Text X_q  โ†’ [Tokenizer] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  H_q
                                                       โ†“
                                              [LLM f_ฮธ] โ†’ Response Y_a

Computational Cost:

  • N tokens โ†’ N ร— N attention matrix
  • Quadratic complexity: O(Nยฒ)
  • Visual tokens๊ฐ€ ๋งŽ์„์ˆ˜๋ก ๋น„์šฉ ๊ธ‰์ฆ

๐Ÿ’ผ LLaVa PruMerge ๋ชฉํ‘œ: Visual tokens ์ˆ˜๋ฅผ ์ค„์—ฌ LLM์˜ computational cost ๊ฐ์†Œ


3.2 Adaptive Important Token Selection via Outlier Detection

ํ•ต์‹ฌ ์งˆ๋ฌธ: โ€œ๊ฐ visual token์˜ ์ค‘์š”๋„๋ฅผ ์–ด๋–ป๊ฒŒ ํŒ๋‹จํ•˜๋Š”๊ฐ€?โ€


๋‘ ๊ฐ€์ง€ ๊ทน๋‹จ์  ํŒจ๋Ÿฌ๋‹ค์ž„

ํŒจ๋Ÿฌ๋‹ค์ž„ ํ† ํฐ ์ˆ˜ ํŠน์ง•
LMM 576๊ฐœ (์ „๋ถ€ ์‚ฌ์šฉ) ์ƒ์„ธํ•œ ์‹œ๊ฐ ์ •๋ณด ํ‘œํ˜„
CLIP 1๊ฐœ ([CLS]๋งŒ ์‚ฌ์šฉ) ๊ฐ€์žฅ ์••์ถ•๋œ ์ •๋ณด ํ‘œํ˜„

๊ท ํ˜•์  ํƒ์ƒ‰: [CLS]-Visual Attention ์กฐ์‚ฌ

์ด ๋‘ ๊ทน๋‹จ์˜ ๊ท ํ˜•์ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด, [CLS] token๊ณผ visual tokens ๊ฐ„์˜ attention์„ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๊ด€์ฐฐ ๊ฒฐ๊ณผ (Figure 3a):

  • Y์ถ•: Log(Class Attention Value)
  • X์ถ•: Visual Token Index (0-575)

๋ถ„ํฌ ํŠน์„ฑ:

  • ๋Œ€๋ถ€๋ถ„์˜ tokens: near-zero attention
  • ์†Œ์ˆ˜์˜ tokens: ๋งค์šฐ ๋†’์€ attention (outliers)

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ: Attention ๋ถ„ํฌ๊ฐ€ ๋งค์šฐ sparseํ•จ
โ†’ ์†Œ์ˆ˜์˜ visual tokens๋งŒ ํ•ต์‹ฌ ์‹œ๊ฐ ์ •๋ณด์™€ ์—ฐ๊ด€๋จ

๊ด€์ฐฐ ๊ฒฐ๊ณผ (Figure 3b):

  • PruMerge: ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ ๊ณณ๋งŒ ์„ ํƒ โ†’ ํšจ์œจ์ ์ด์ง€๋งŒ ์ผ๋ถ€ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ
  • PruMerge+: ์ค‘์š”ํ•œ ๊ณณ + ๊ท ๋“ฑ ์ƒ˜ํ”Œ๋ง โ†’ ์•ฝ๊ฐ„์˜ ํ† ํฐ ์ฆ๊ฐ€๋กœ ์ปค๋ฒ„๋ฆฌ์ง€ ๋ณด์žฅ

IQR (Interquartile Range) ๊ธฐ๋ฐ˜ Outlier Detection

Sparseํ•œ attention ๋ถ„ํฌ์—์„œ outlier = ์ค‘์š”ํ•œ ํ† ํฐ์œผ๋กœ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜:

1
2
3
4
5
6
7
8
9
10
11
12
# 1. Attention ๊ฐ’์˜ quartiles ๊ณ„์‚ฐ
Q1 = percentile(a_cls, 25)  # 1์‚ฌ๋ถ„์œ„์ˆ˜
Q3 = percentile(a_cls, 75)  # 3์‚ฌ๋ถ„์œ„์ˆ˜

# 2. IQR ๊ณ„์‚ฐ
IQR = Q3 - Q1

# 3. Upper fence (threshold) ๊ณ„์‚ฐ
upper_fence = Q3 + 1.5 * IQR

# 4. Outliers = Important tokens
important_indices = where(a_cls > upper_fence)

https://docs.oracle.com/cloud/help/ko/pbcs_common/PFUSU/insights_metrics_IQR.htm#PFUSU-GUID-CF37CAEA-730B-4346-801E-64612719FF6B

์™œ IQR์ธ๊ฐ€?

  • Attention score๋Š” ์–‘์ˆ˜์ด๋ฏ€๋กœ upper fence๋งŒ ์‚ฌ์šฉ
  • ๊ฐ ์ด๋ฏธ์ง€์˜ ๋ถ„ํฌ์— ๋”ฐ๋ผ threshold๊ฐ€ ์ž๋™ ์กฐ์ ˆ
  • ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ฆ๋œ robustํ•œ outlier detection

Adaptive Selection์˜ ํŠน์„ฑ

์ด๋ฏธ์ง€ ๋ณต์žก๋„์— ๋”ฐ๋ฅธ ์ž๋™ ์กฐ์ ˆ:

์ด๋ฏธ์ง€ ์œ ํ˜• ํŠน์„ฑ ์„ ํƒ ํ† ํฐ ์ˆ˜
๋ณต์žกํ•œ ์ด๋ฏธ์ง€ (ํ…์ŠคํŠธ ๅคš) Attention outlier ๅคš ๋งŽ์Œ (40-50๊ฐœ)
๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ (ํ•˜๋Š˜+๊ฐ„ํŒ) Attention outlier ๅฐ‘ ์ ์Œ (10-20๊ฐœ)

๋ฒค์น˜๋งˆํฌ๋ณ„ ํ‰๊ท  ํ† ํฐ ์ˆ˜ (Table 4):

๋น„๊ต ๋Œ€์ƒ (๋™์ผํ•œ ํ† ํฐ ์ˆ˜์—์„œ):

๋ฐฉ๋ฒ• ์„ค๋ช…
LLaVA-PruMerge IQR ๊ธฐ๋ฐ˜ adaptive selection
Sequential ์•ž์—์„œ๋ถ€ํ„ฐ ์ˆœ์ฐจ์ ์œผ๋กœ N๊ฐœ ์„ ํƒ
Spatial ๊ณต๊ฐ„์ ์œผ๋กœ ๊ท ๋“ฑํ•˜๊ฒŒ N๊ฐœ ์„ ํƒ (์˜ˆ: 5ร—8, 8ร—5)
1
2
3
4
5
6
7
8
9
10
Sequential ์„ ํƒ:
[T1, T2, T3, ..., T40] โ† ์•ž์ชฝ 40๊ฐœ๋งŒ ์„ ํƒ

์ด๋ฏธ์ง€ ํŒจ์น˜ ์ˆœ์„œ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ T1  T2  T3  ... T24    โ”‚ โ† ์ƒ๋‹จ๋งŒ ์„ ํƒ๋จ
โ”‚ T25 T26 T27 ... T48    โ”‚
โ”‚ ...                    โ”‚ โ† ํ•˜๋‹จ์€ ์™„์ „ ๋ฌด์‹œ
โ”‚ T553 ... T576          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
1
2
3
4
5
6
7
8
Spatial ์„ ํƒ (5ร—8 = 40):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ–   ยท  ยท  ยท  โ–   ยท  ยท  ยท โ”‚
โ”‚ ยท  ยท  ยท  ยท  ยท  ยท  ยท  ยท โ”‚
โ”‚ โ–   ยท  ยท  ยท  โ–   ยท  ยท  ยท โ”‚
โ”‚ ยท  ยท  ยท  ยท  ยท  ยท  ยท  ยท โ”‚
โ”‚ โ–   ยท  ยท  ยท  โ–   ยท  ยท  ยท โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
1
2
3
4
5
6
7
8
PruMerge ์„ ํƒ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ยท  ยท  โ–   โ–   โ–   ยท  ยท  ยท โ”‚
โ”‚ ยท  ยท  โ–   โ–   โ–   ยท  ยท  ยท โ”‚  โ† ํ…์ŠคํŠธ/๊ฐ์ฒด ์˜์—ญ์— ์ง‘์ค‘
โ”‚ ยท  ยท  โ–   โ–   โ–   ยท  ยท  ยท โ”‚
โ”‚ ยท  ยท  ยท  ยท  ยท  ยท  ยท  ยท โ”‚
โ”‚ ยท  ยท  ยท  ยท  ยท  ยท  ยท  ยท โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Task: TextVQA (40 tokens)

Approach Performance
LLaVA-PruMerge 54.00
Sequential 42.72
Spatial (5ร—8) 46.85
Spatial (8ร—5) 47.42

Task: MME (40 tokens)

Approach Performance
LLaVA-PruMerge 1250.07
Sequential 703.60
Spatial (5ร—8) 1180.23
Spatial (8ร—5) 1142.32

Task: POPE (35 tokens)

Approach Performance
LLaVA-PruMerge 76.2
Sequential 11.7
Spatial (5ร—7) 69.8
Spatial (7ร—5) 71.1

Task: ScienceQA (16 tokens)

Approach Performance
LLaVA-PruMerge 68.07
Sequential 64.20
Spatial (4ร—4) 66.29

Penultimate Layer ์‚ฌ์šฉ

์™œ ๋งˆ์ง€๋ง‰ layer๊ฐ€ ์•„๋‹Œ penultimate (๋์—์„œ ๋‘ ๋ฒˆ์งธ) layer?

  • ๋งˆ์ง€๋ง‰ layer: Classification์— ํŠนํ™”
  • Penultimate layer: ๋” richํ•œ feature representation ๋ณด์œ 

3.3 Token Supplement via Similar Key Clustering

โ€œWhile pruned tokens may initially seem extraneous, they hold potential value for the perception capabilities of the LLM backbone.โ€

๋ฌธ์ œ: Pruned Tokens์˜ ์ •๋ณด ์†์‹ค

Pruned tokens๋ฅผ ์™„์ „ํžˆ ๋ฒ„๋ฆฌ๋ฉด:

  • ํฐ ๊ฐ์ฒด๊ฐ€ scene์„ ์ง€๋ฐฐํ•˜๋Š” ๊ฒฝ์šฐ ์ •๋ณด ์†์‹ค
  • ๋ชจ๋ธ์˜ representation ๋Šฅ๋ ฅ ์ €ํ•˜ ๊ฐ€๋Šฅ

์˜ˆ์‹œ: ํฐ ๊ฐ์ฒด๊ฐ€ ํ™”๋ฉด์„ ์ง€๋ฐฐํ•˜๋Š” ๊ฒฝ์šฐ

1
2
3
4
5
6
7
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜      โ”‚
โ”‚  ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜      โ”‚
โ”‚  ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜      โ”‚  โ† ์ฝ”๋ผ๋ฆฌ๊ฐ€ ์ด๋ฏธ์ง€ ๋Œ€๋ถ€๋ถ„ ์ฐจ์ง€
โ”‚  ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜ ๐Ÿ˜      โ”‚
โ”‚  ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ ๐ŸŒฟ      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

IQR Outlier Selection ๊ฒฐ๊ณผ:

  • ์ฝ”๋ผ๋ฆฌ์˜ ๋ˆˆ, ๊ท€ ๋“ฑ ํŠน์ง•์ ์ธ ๋ถ€๋ถ„๋งŒ ์„ ํƒ (5-10๊ฐœ)
  • ๋‚˜๋จธ์ง€ ์ฝ”๋ผ๋ฆฌ ๋ชธํ†ต ๋ถ€๋ถ„์€ pruned๋จ

๋ฌธ์ œ: ์ฝ”๋ผ๋ฆฌ ์ „์ฒด๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ์— ์ •๋ณด ๋ถ€์กฑ

ํ•ด๊ฒฐ์ฑ…: Pruned tokens๋ฅผ ๋ฒ„๋ฆฌ์ง€ ์•Š๊ณ  ์„ ํƒ๋œ ํ† ํฐ์— ๋ณ‘ํ•ฉ(merge)ํ•ด์ฃผ๋ฉด, ๊ทธ ํŠน์ง•์„ ์‚ด๋ ค์ค„ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?


Token Similarity ์ธก์ •: Key Vector ํ™œ์šฉ

โ€œSince the key vector of each patch token already contains information summarized in the self-attention module, the final layerโ€™s key vector serves as the representation.โ€

์™œ Key vector์ธ๊ฐ€?

  • Self-attention์—์„œ key vector๋Š” ์ด๋ฏธ ํ•ด๋‹น ํ† ํฐ์˜ ์ •๋ณด๋ฅผ ์š”์•ฝ
  • ๋ณ„๋„ ๊ณ„์‚ฐ ์—†์ด ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ

Similarity ๊ณ„์‚ฐ:

Sim(yi,yj)=kiโ‹…kjT\text{Sim}(y_i, y_j) = \mathbf{k}_i \cdot \mathbf{k}_j^TSim(yiโ€‹,yjโ€‹)=kiโ€‹โ‹…kjTโ€‹

์ „์ฒด ํ† ํฐ์— ๋Œ€ํ•ด ๋ฒกํ„ฐํ™”: KKT\mathbf{K}\mathbf{K}^TKKT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Similarity Matrix (576 ร— 576):
# K: ๋ชจ๋“  ํ† ํฐ์˜ Key vectors [576, d_k]
# d_k: key dimension (์˜ˆ: 64)

              T0     T1     T2     T3    ...   T575
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   T0   โ”‚  1.0    0.3    0.1    0.8   ...   0.2  โ”‚
   T1   โ”‚  0.3    1.0    0.7    0.2   ...   0.4  โ”‚
   T2   โ”‚  0.1    0.7    1.0    0.1   ...   0.3  โ”‚
   T3   โ”‚  0.8    0.2    0.1    1.0   ...   0.5  โ”‚
   ...  โ”‚  ...    ...    ...    ...   ...   ...  โ”‚
   T575 โ”‚  0.2    0.4    0.3    0.5   ...   1.0  โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

similarity_matrix[i][j] = token i์™€ token j์˜ ์œ ์‚ฌ๋„

K-Nearest Neighbor Clustering & Weighted Merge

๊ณผ์ •:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def token_merge(K, Y, a_cls, unpruned_indices, k=32):
    """
    K: Key vectors [576, d_k]
    Y: Token features [576, d]
    a_cls: Class attention [576]
    unpruned_indices: IQR๋กœ ์„ ํƒ๋œ ์ธ๋ฑ์Šค [m]
    k: neighbors ์ˆ˜
    """
    
    # Step 1: ์œ ์‚ฌ๋„ ํ–‰๋ ฌ ๊ณ„์‚ฐ
    similarity_matrix = K @ K.T  # [576, 576]
    
    # Step 2: ๊ฐ center์— ๋Œ€ํ•ด merge
    merged_tokens = []
    
    for p in unpruned_indices:
        # ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ k-nearest neighbors
        sims = similarity_matrix[p]            # [576]
        neighbor_idx = argsort(sims)[-k:]      # top-k indices
        
        # Class attention ๊ฐ€์ค‘์น˜
        weights = a_cls[neighbor_idx]          # [k]
        
        # Weighted sum
        merged = (weights @ Y[neighbor_idx]) / weights.sum()
        merged_tokens.append(merged)
    
    return stack(merged_tokens)  # [m, d]

ํ•ต์‹ฌ:

  • ์„ ํƒ๋œ ํ† ํฐ = Cluster center
  • Pruned tokens = ๊ฐ€์žฅ ์œ ์‚ฌํ•œ center์— ๋ณ‘ํ•ฉ
  • Class attention์„ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉ โ†’ ์ค‘์š”ํ•œ ์ •๋ณด ๋” ๋งŽ์ด ๋ฐ˜์˜

(๊ฐœ์ธ์˜๊ฒฌ) ์ฝ”๋“œ ๊ตฌํ˜„์ฒด์—์„œ๋Š” k๋ฅผ 32๋กœ ๊ณ ์ •ํ•ด๋‘๋Š”๋ฐ, ์ด๋ฅผ dynamicํ•˜๊ฒŒ ๋ฐ”๊พธ๋Š”๊ฒŒ ๋” ๋งž์ง€ ์•Š์„๊นŒ? ๐Ÿค” (๋งํฌ)

1
2
3
4
5
6
7
# 1. Cosine Similarity ๊ณ„์‚ฐ (KK^T ๋Œ€์‹  normalized dot product)
cos_sim_matrix = torch.bmm(key_others_norm, rest_Keys.transpose(1, 2))
## bmm : Batch ๋‹จ์œ„๋กœ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•จ์ˆ˜

# 2. Top-k Nearest Neighbors ์„ ํƒ โ† ์ด๊ฒŒ KNN!
_, cluster_indices = torch.topk(cos_sim_matrix, k=int(32), dim=2, largest=True)
## topk: Tensor์—์„œ ๊ฐ€์žฅ ํฐ (๋˜๋Š” ์ž‘์€) k๊ฐœ์˜ ๊ฐ’๊ณผ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜

3.4 PruMerge+: Bridging the Efficiency-Performance Gap

๋ฌธ์ œ: PruMerge์˜ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ

PruMerge๋Š” ~14๋ฐฐ ์••์ถ• (5.5% tokens)์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ:

  • ์›๋ณธ LLaVA ๋Œ€๋น„ marginal performance drop ๋ฐœ์ƒ
  • ํŠน์ • ์˜์—ญ์— ํ† ํฐ์ด ํŽธ์ค‘๋  ์ˆ˜ ์žˆ์Œ

ํ•ด๊ฒฐ์ฑ…: Spatial Uniform Sampling ์ถ”๊ฐ€

PruMerge+ ์ „๋žต:

1
Final Tokens = Attention-based Outliers + Spatially-uniform Samples

์•Œ๊ณ ๋ฆฌ์ฆ˜:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Step 1: IQR๋กœ outlier ratio ๊ณ„์‚ฐ
if if_adaptive:
    reduction_ratio = outlier_dectection(cls_attn)  # ์˜ˆ: 0.05 (5%)

# Step 2: Top-k๋กœ outlier ์„ ํƒ
_, idx = torch.topk(cls_attn, int(N * reduction_ratio), dim=1, largest=True)
# idx: [B, ~32] โ† IQR ๊ธฐ๋ฐ˜ ์„ ํƒ๋œ ์ธ๋ฑ์Šค

# Step 3: Spatial Uniform Sampling
if if_adaptive:
    step_length = int(1 / reduction_ratio)  # ์˜ˆ: 1/0.05 = 20
    
    # ๊ท ๋“ฑ ๊ฐ„๊ฒฉ์œผ๋กœ ์ƒ˜ํ”Œ๋ง (step_length/3 ๊ฐ„๊ฒฉ)
    arithmetic_sequence = torch.arange(0, 575, int(step_length / 3))
    # ์˜ˆ: step=20 โ†’ step/3โ‰ˆ6 โ†’ [0, 6, 12, 18, 24, ..., 570]
    
    # ์ด๋ฏธ ์„ ํƒ๋œ ์ธ๋ฑ์Šค ์ œ์™ธ (์ค‘๋ณต ์ œ๊ฑฐ)
    original_tensor_1d = idx.flatten()
    filtered_sequence = [x for x in arithmetic_sequence if x not in original_tensor_1d]
    
    # Step 4: Union (ํ•ฉ์ง‘ํ•ฉ)
    concatenated_tensor = torch.cat((idx, filtered_sequence.unsqueeze(0)), dim=1)
    idx = concatenated_tensor  # ์ตœ์ข… ์ธ๋ฑ์Šค

ํšจ๊ณผ:

  • ๊ณต๊ฐ„์ ์œผ๋กœ underrepresented ์˜์—ญ ๋ณด์™„
  • ๋” comprehensiveํ•œ visual representation

PruMerge vs PruMerge+ ๋น„๊ต

ํ•ญ๋ชฉ PruMerge PruMerge+
์••์ถ•๋ฅ  ~14ร— (5.5%) ~4ร— (25%)
์„ ํƒ ๋ฐฉ์‹ IQR Outlier๋งŒ Outlier + Spatial Uniform
๊ณต๊ฐ„ ์ปค๋ฒ„๋ฆฌ์ง€ ํŽธ์ค‘ ๊ฐ€๋Šฅ ๊ท ๋“ฑ ๋ณด์žฅ

์„ฑ๋Šฅ ๋น„๊ต (Vicuna-7B):

Metric LLaVA-1.5 PruMerge PruMerge+
VQAv2 78.5 72.0 76.8
ScienceQA 66.8 68.5 68.3
TextVQA 58.2 56.0 57.1
POPE 85.9 76.3 84.0
MME 1510.7 1350.3 1462.4
MMBench 64.3 60.9 64.9

Trade-off:

  • PruMerge: ์ตœ๋Œ€ ํšจ์œจ์„ฑ (14ร— ์••์ถ•), ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ์ €ํ•˜
  • PruMerge+: ํšจ์œจ์„ฑ + ์„ฑ๋Šฅ ๊ท ํ˜• (4ร— ์••์ถ•, ๊ฑฐ์˜ ์›๋ณธ ์„ฑ๋Šฅ)

3.5 Algorithm Summary

Algorithm 1: Token PruMerge and PruMerge+

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Input: K, Q (penultimate layer), Y (output tokens), n (token count)
# Output: Y' (m tokens, m << n)

def token_prumerge(K, Q, Y, n):
    # Step 1: Calculate class attention
    a_cls = calculate_attention(Q_cls, K)  # Eq 3.2

    # Step 2: Adaptive token selection via IQR
    indices = IQR_outlier_detection(a_cls)  # Sec 3.2
    m = len(indices)
    selected_indices = indices

    # Step 3 (Optional - PruMerge+): Spatial sampling
    if PRUMERGE_PLUS:
        r_o = m / n
        spatial_indices = spatial_uniform_sample(r_o)
        selected_indices = indices + spatial_indices
        m = len(selected_indices)

    # Step 4: Token merging via k-NN clustering
    Y_prime = []
    for p in selected_indices:
        # Calculate similarity
        similarities = cosine_similarity(
            K[p],
            K[others]
        )

        # Find k nearest neighbors
        neighbor_indices = topk(similarities, k=32)

        # Weighted averaging
        weights = a_cls[neighbor_indices]
        y_p_prime = weighted_average(
            Y[neighbor_indices],
            weights=weights
        )

        Y_prime.append(y_p_prime)

    return Y_prime  # m tokens

ํ•ต์‹ฌ ๋‹จ๊ณ„:

  1. AITS: IQR๋กœ ์ค‘์š” ํ† ํฐ ์„ ํƒ
  2. (Optional) Spatial sampling
  3. TS: k-NN clustering + weighted merging

  1. Experiments

4.1 Main Results

์‹คํ—˜ ์„ค์ •

Base Model: LLaVA-1.5 (7B, 13B)

  • CLIP ViT-L/14 vision encoder
  • Vicuna-7B / Vicuna-13B LLM
  • 336ร—336 resolution
  • ์›๋ณธ: 576 visual tokens

Training:

  • LoRA fine-tuning (1 epoch)
  • LLaVA-1.5 instruction data ์‚ฌ์šฉ
  • Reduced visual tokens๋กœ ํ•™์Šต

Evaluation Benchmarks:

  1. VQAv2: Visual question answering
  2. ScienceQA (SQAI): Multimodal reasoning
  3. TextVQA (VQAT): OCR-based QA
  4. POPE: Hallucination evaluation
  5. MME: Perception & cognition
  6. MMBench (MMB): Comprehensive evaluation

์„ฑ๋Šฅ ๋น„๊ต

Table 1: 6๊ฐœ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

Method LLM VQAv2 SQAI VQAT POPE MME MMB
Existing Methods ย  ย  ย  ย  ย  ย  ย 
BLIP-2 Vicuna-13B 41.0 61.0 42.5 85.3 1293.8 -
InstructBLIP Vicuna-13B - 63.1 50.7 78.9 1212.8 -
Qwen-VL-Chat Qwen-7B 78.2 68.2 61.5 - 1487.5 60.6
LLaVA-1.5 Baselines ย  ย  ย  ย  ย  ย  ย 
LLaVA-1.5 Vicuna-7B 78.5 66.8 58.2 85.9 1510.7 64.3
+ PruMerge (5.5%) Vicuna-7B 72.0 68.5 56.0 76.3 1350.3 60.9
+ PruMerge+ (25%) Vicuna-7B 76.8 68.3 57.1 84.0 1462.4 64.9
LLaVA-1.5 Vicuna-13B 80.0 71.6 61.3 85.9 1531.3 67.7
+ PruMerge (5.5%) Vicuna-13B 72.8 71.0 58.4 78.5 1428.2 62.3
+ PruMerge+ (25%) Vicuna-13B 77.8 71.0 58.6 84.4 1485.5 65.7

์ฃผ์š” ๋ฐœ๊ฒฌ:

  1. PruMerge+ (25% tokens):

    • VQAv2: 76.8 (์›๋ณธ 78.5 ๋Œ€๋น„ -1.7)
    • ScienceQA: 68.3 (์›๋ณธ 66.8 ๋Œ€๋น„ +1.5)
    • MME: 1462.4 (์›๋ณธ 1510.7 ๋Œ€๋น„ -48.3)
    • MMBench: 64.9 (์›๋ณธ 64.3 ๋Œ€๋น„ +0.6)
    • โ†’ Comparable performance
  2. PruMerge (5.5% tokens):

    • ScienceQA: 68.5 (์›๋ณธ ๋Œ€๋น„ +1.7)
    • POPE: 76.3 (์›๋ณธ 85.9 ๋Œ€๋น„ -9.6)
    • โ†’ ์ผ๋ถ€ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ!
  3. vs. Previous Methods:

    • BLIP-2, InstructBLIP ๋Œ€๋น„ ํ›จ์”ฌ ์šฐ์ˆ˜
    • Qwen-VL-Chat๊ณผ comparable

์™œ ์ผ๋ถ€ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ?

ScienceQA์—์„œ ํ–ฅ์ƒ ์ด์œ :

  • ์ค‘์š”ํ•œ ์‹œ๊ฐ ์ •๋ณด์— ์ง‘์ค‘
  • Redundant tokens ์ œ๊ฑฐ๋กœ signal-to-noise ๋น„์œจ ํ–ฅ์ƒ
  • ์ถ”๋ก ์— ํ•„์š”ํ•œ ํ•ต์‹ฌ features๋งŒ ์„ ํƒ

POPE์—์„œ PruMerge๊ฐ€ ์•ฝํ•œ ์ด์œ :

  • Object presence detection ํ•„์š”
  • Spatial coverage ์ค‘์š”
  • Aggressive pruning (5.5%)์œผ๋กœ ์ผ๋ถ€ ๊ฐ์ฒด ์ •๋ณด ์†์‹ค
  • โ†’ PruMerge+๊ฐ€ ์ด ๋ฌธ์ œ ํ•ด๊ฒฐ (84.0)

4.2 Efficiency Analysis

Computational Cost (Table 2)

์‹คํ—˜ ํ™˜๊ฒฝ: Tesla V100 GPU
๋ฐฉ๋ฒ•๋ก : Roofline model ๊ธฐ๋ฐ˜ theoretical analysis

LLaVA-1.5 (Vicuna-7B):

Config FLOPs (TB) Prefill Time (ms) Total Memory (GB) Activation (GB)
FP16 ย  ย  ย  ย 
Original 9.3 88.6 23.3 4.60
+ PruMerge 0.91 15.3 13.7 0.28
Speedup 10.2ร— 5.8ร— 1.7ร— 16.4ร—
INT4 ย  ย  ย  ย 
Original 2.3 151.6 5.9 1.20
+ PruMerge 0.28 14.9 3.5 0.07
Speedup 8.2ร— 10.2ร— 1.7ร— 17.1ร—

LLaVA-1.5 (Vicuna-13B):

Config FLOPs (TB) Prefill Time (ms) Total Memory (GB) Activation (GB)
FP16 ย  ย  ย  ย 
Original 18.2 170.5 41.6 7.30
+ PruMerge 1.80 29.5 26.6 0.44
Speedup 10.1ร— 5.8ร— 1.6ร— 16.6ร—
INT4 ย  ย  ย  ย 
Original 4.6 294.9 10.5 1.80
+ PruMerge 0.45 29.0 6.8 0.11
Speedup 10.2ร— 10.2ร— 1.5ร— 16.4ร—

ํ•ต์‹ฌ ํšจ์œจ์„ฑ ํ–ฅ์ƒ:

  1. FLOPs ๊ฐ์†Œ: ~10๋ฐฐ

    • Quadratic complexity ํšจ๊ณผ: O(nยฒ) โ†’ O(mยฒ)
    • 576ยฒ โ†’ 40ยฒ โ‰ˆ 331,776 โ†’ 1,600
  2. Prefill Time: 5.8~10.2๋ฐฐ ๋นจ๋ผ์ง

    • FP16: 88.6ms โ†’ 15.3ms
    • INT4: 151.6ms โ†’ 14.9ms
    • INT4 + PruMerge๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฆ„!
  3. Memory ์ ˆ๊ฐ:

    • Total: 1.5~1.7๋ฐฐ ๊ฐ์†Œ
    • Activation: 16๋ฐฐ ์ด์ƒ ๊ฐ์†Œ
  4. Quantization๊ณผ์˜ ์‹œ๋„ˆ์ง€:

    • INT4 quantization ์ ์šฉ ์‹œ ๋” ๋น ๋ฅธ ์†๋„
    • Orthogonal techniques๋กœ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ

Scenario Analysis

๊ฐ€์ •:

  • Image: 336ร—336 (576 visual tokens)
  • Text prompt: 40 tokens
  • PruMerge ์ ์šฉ ํ›„: 40 visual tokens

Token ์ˆ˜ ๋น„๊ต:

1
2
3
4
Original:  576 (visual) + 40 (text) = 616 tokens
PruMerge:   40 (visual) + 40 (text) =  80 tokens

Reduction: 616 โ†’ 80 (7.7ร— fewer tokens)

Attention Computation:

1
2
3
4
Original:  616ยฒ = 379,456 operations
PruMerge:   80ยฒ =   6,400 operations

Speedup: 59.3ร— in attention matrix computation

4.3 Generalization on Video-LLM

Video-LLaVA ํ†ตํ•ฉ

Video-LLaVA ํŠน์„ฑ:

  • 8 frames per video clip
  • 16ร—16 patches per frame
  • 2048 visual tokens (8 ร— 256)
  • LLaVA-1.5 ๋Œ€๋น„ 4๋ฐฐ ๋งŽ์€ tokens

PruMerge ์ ์šฉ (Training-free):

  • Inference ์‹œ์—๋งŒ ์ ์šฉ
  • ์ถ”๊ฐ€ ํ•™์Šต ๋ถˆํ•„์š”
  • ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

๊ฒฐ๊ณผ (Table 3)

Video QA Benchmarks:

Method LLM MSVD-QA ย  MSRVT-QA ย  ActivityNet-QA ย 
ย  ย  Acc Score Acc Score Acc Score
Baselines ย  ย  ย  ย  ย  ย  ย 
FrozenBiLM 1B 32.2 - 16.8 - 24.7 -
VideoChat 7B 56.3 2.8 45.0 2.5 - 2.2
LLaMA-Adapter 7B 54.9 3.1 43.8 2.7 34.2 2.7
Video-LLaMA 7B 51.6 2.5 29.6 1.8 12.4 1.1
Video-ChatGPT 7B 64.9 3.3 49.3 2.8 35.2 2.7
Video-LLaVA ย  ย  ย  ย  ย  ย  ย 
Original 7B 70.7 3.9 59.2 3.5 45.3 3.3
+ PruMerge (12.5%) 7B 71.1 3.9 58.4 3.5 48.3 3.4
+ PruMerge+ (25%) 7B 71.1 3.9 59.3 3.6 47.7 3.4

๋†€๋ผ์šด ๋ฐœ๊ฒฌ:

  1. ์„ฑ๋Šฅ ํ–ฅ์ƒ:

    • MSVD-QA: 70.7 โ†’ 71.1 (+0.4)
    • ActivityNet-QA: 45.3 โ†’ 48.3 (+3.0)
    • Token ๊ฐ์†Œํ–ˆ๋Š”๋ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ!
  2. ํ† ํฐ ์••์ถ•:

    • Original: 2048 tokens
    • PruMerge: 256 tokens (12.5%)
    • PruMerge+: 512 tokens (25%)
    • 8๋ฐฐ / 4๋ฐฐ ์••์ถ•
  3. Training-free:

    • Video ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต ๋ถˆํ•„์š”
    • Inference ์‹œ์—๋งŒ ์ ์šฉ
    • ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

Insight:

  • Video tokens์—๋„ significant redundancy ์กด์žฌ
  • Temporal + spatial redundancy ๋ชจ๋‘ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • Future direction: Temporal token reduction ํƒ๊ตฌ

4.4 Ablation Study

์šฉ์–ด ์ •๋ฆฌ

PruMerge์˜ ๋‘ ๋ชจ๋“ˆ

  • AITS : Adaptive Important Token Selection
    • IQR๋กœ ์ค‘์š” ํ† ํฐ ์„ ํƒ
  • TS: Token Supplement
    • KNN์œผ๋กœ pruned ์ •๋ณด ๋ณ‘ํ•ฉ

4.4.1 Token Sampling Strategy Analysis (Table 4)

๋น„๊ต ์ „๋žต:

  1. LLaVA-PruMerge: IQR-based adaptive sampling
  2. Sequential: ์ฒ˜์Œ N๊ฐœ ํ† ํฐ ์„ ํƒ
  3. Spatial: N๊ฐœ ํ† ํฐ์„ ๊ณต๊ฐ„์ ์œผ๋กœ ๊ท ๋“ฑ ๋ฐฐ์น˜

๊ฒฐ๊ณผ (๋™์ผํ•œ ํ† ํฐ ์ˆ˜๋กœ ๋น„๊ต):

TextVQA (40 tokens):

  • PruMerge: 54.00
  • Sequential: 42.72
  • Spatial 5ร—8: 46.85
  • Spatial 8ร—5: 47.42
  • โ†’ PruMerge๊ฐ€ 11.3% ๋” ๋†’์Œ

MME (40 tokens):

  • PruMerge: 1250.07
  • Sequential: 703.60
  • Spatial 5ร—8: 1180.23
  • Spatial 8ร—5: 1142.32
  • โ†’ PruMerge๊ฐ€ 77.7% ๋” ๋†’์Œ

POPE (35 tokens):

  • PruMerge: 76.2
  • Sequential: 11.7 (!)
  • Spatial 5ร—7: 69.8
  • Spatial 7ร—5: 71.1
  • Spatial 6ร—6: 67.9
  • โ†’ PruMerge๊ฐ€ 6.5๋ฐฐ ๋†’์Œ

ScienceQA (16 tokens):

  • PruMerge: 68.07
  • Sequential: 64.20
  • Spatial 4ร—4: 66.29
  • โ†’ PruMerge๊ฐ€ 3.87% ๋” ๋†’์Œ

๋ถ„์„:

Sequential์˜ ๋ฌธ์ œ:

  • ์ฒ˜์Œ N๊ฐœ ํ† ํฐ = ์ด๋ฏธ์ง€ ํŠน์ • ์˜์—ญ๋งŒ
  • Spatial bias ์‹ฌ๊ฐ
  • POPE์—์„œ ๊ฑฐ์˜ random guess (11.7)

Spatial์˜ ์žฅ์ :

  • ์ „์ฒด ์ด๋ฏธ์ง€ ์ปค๋ฒ„๋ฆฌ์ง€
  • ๊ท ํ˜•์žกํžŒ representation
  • Sequential๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜

PruMerge์˜ ์šฐ์ˆ˜์„ฑ:

  • Attention-guided selection
  • ์ •๋ณด ๋ฐ€๋„ ๋†’์€ ์˜์—ญ ์ง‘์ค‘
  • Adaptive to image complexity
  • ํŠนํžˆ TextVQA (OCR)์—์„œ ํฐ ์ฐจ์ด
    • ํ…์ŠคํŠธ ์˜์—ญ์— ํ† ํฐ ์ง‘์ค‘
    • ์„ธ๋ฐ€ํ•œ ์ •๋ณด ๋ณด์กด

4.4.2 Effectiveness of Each Module (Table 5)

์‹คํ—˜ ์„ค์ •:

  • ๊ณ ์ •: 40 tokens (6.9%)
  • Vicuna-7B ๋ชจ๋ธ
  • 4๊ฐœ ๋ฒค์น˜๋งˆํฌ

Module ์กฐํ•ฉ:

Method SQAI VQAT POPE MME
LLaVA-1.5 (baseline) 66.8 58.2 85.9 1510.7
w. AITS only 66.5 54.8 75.7 1221.6
w. AITS & TS 68.5 56.0 76.3 1350.3

๋ถ„์„:

AITS (Adaptive Important Token Selection) ๋‹จ๋…:

  • SQA: 66.5 (baseline 66.8)
  • TextVQA: 54.8 (baseline 58.2)
  • POPE: 75.7 (baseline 85.9)
  • MME: 1221.6 (baseline 1510.7)
  • โ†’ ํ† ํฐ ์„ ํƒ๋งŒ์œผ๋กœ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜

AITS + TS (Token Supplement):

  • SQA: 68.5 (baseline ๋Œ€๋น„ +1.7)
  • TextVQA: 56.0 (baseline ๋Œ€๋น„ -2.2)
  • POPE: 76.3 (baseline ๋Œ€๋น„ -9.6)
  • MME: 1350.3 (baseline ๋Œ€๋น„ -160.4)
  • โ†’ Token merging์ด ํ•„์ˆ˜์ !

TS์˜ ํšจ๊ณผ:

  • SQA: +2.0 (66.5 โ†’ 68.5)
  • TextVQA: +1.2 (54.8 โ†’ 56.0)
  • POPE: +0.6 (75.7 โ†’ 76.3)
  • MME: +128.7 (1221.6 โ†’ 1350.3)
  • โ†’ ๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๊ฐœ์„ 

ํ•ต์‹ฌ Insight:

  • Token selection๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑ
  • Merging์ด pruned tokens ์ •๋ณด ๋ณด์กด
  • k-NN clustering + weighted averaging ํšจ๊ณผ

4.4.3 Training Analysis (Table 6)

๋น„๊ต:

  1. Training-free: PruMerge๋งŒ ์ ์šฉ, ํ•™์Šต X
  2. LoRA fine-tuning: PruMerge + LoRA 1 epoch

๊ฒฐ๊ณผ (40 tokens, Vicuna-7B):

Method SQAI VQAT POPE MME
LLaVA-1.5 (baseline) 66.8 58.2 85.9 1510.7
w.o. LoRA-FT 68.0 54.0 76.2 1250.1
w. LoRA-FT 68.5 56.0 76.3 1350.3

๋ถ„์„:

Training-free ์„ฑ๋Šฅ:

  • SQA: 68.0 (baseline ๋Œ€๋น„ +1.2)
  • TextVQA: 54.0 (baseline ๋Œ€๋น„ -4.2)
  • POPE: 76.2 (baseline ๋Œ€๋น„ -9.7)
  • MME: 1250.1 (baseline ๋Œ€๋น„ -260.6)
  • โ†’ ์ผ๋ถ€ ํƒœ์Šคํฌ๋Š” ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

Fine-tuning ํšจ๊ณผ:

  • SQA: +0.5 (68.0 โ†’ 68.5)
  • TextVQA: +2.0 (54.0 โ†’ 56.0)
  • POPE: +0.1 (76.2 โ†’ 76.3)
  • MME: +100.2 (1250.1 โ†’ 1350.3)
  • โ†’ ๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๊ฐœ์„ 

Trade-off:

  • Training-free: ๋น ๋ฅธ ์ ์šฉ, ์ผ๋ถ€ ์„ฑ๋Šฅ ์ €ํ•˜
  • Fine-tuning: ์ตœ๊ณ  ์„ฑ๋Šฅ, ์ถ”๊ฐ€ ํ•™์Šต ํ•„์š” (1 epoch)

์‹ค์šฉ์  ์„ ํƒ:

  • Resource ์ถฉ๋ถ„: Fine-tuning ๊ถŒ์žฅ
  • ๋น ๋ฅธ ์ ์šฉ ํ•„์š”: Training-free๋กœ ์‹œ์ž‘

  1. ์š”์•ฝ

5.1 Adaptive Token Selection

ํ•ต์‹ฌ ํ˜์‹ :

  • IQR-based outlier detection: ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ฆ๋œ ๋ฐฉ๋ฒ•
  • Image-specific adaptation: ์ด๋ฏธ์ง€๋งˆ๋‹ค ๋‹ค๋ฅธ ์ˆ˜์˜ ํ† ํฐ
  • Learned importance: ๋ชจ๋ธ์ด ํ•™์Šตํ•œ attention pattern ํ™œ์šฉ

์žฅ์ :

  • Manual threshold ๋ถˆํ•„์š”
  • Robust to different image types
  • Computation-efficient (๋‹จ์ˆœ ํ†ต๊ณ„ ๊ณ„์‚ฐ)

5.2 Token Merging via k-NN

ํ•ต์‹ฌ ํ˜์‹ :

  • Information preservation: Pruned tokens ์ •๋ณด ๋ณด์กด
  • Similarity-based clustering: Semantic ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜
  • Weighted aggregation: Attention์œผ๋กœ ๊ฐ€์ค‘

์žฅ์ :

  • Lossless์— ๊ฐ€๊นŒ์šด ์••์ถ•
  • Semantic consistency ์œ ์ง€
  • Large objects ์ •๋ณด ๋ณด์กด

5.3 PruMerge+ Hybrid Strategy

ํ•ต์‹ฌ ํ˜์‹ :

  • Attention + Spatial: ๋‘ ๊ฐ€์ง€ ์›์น™ ๊ฒฐํ•ฉ
  • Balanced coverage: ์ „์ฒด ์ด๋ฏธ์ง€ ์ปค๋ฒ„๋ฆฌ์ง€
  • Performance-efficiency trade-off: ์„ ํƒ ๊ฐ€๋Šฅ

์žฅ์ :

  • Minimal performance drop
  • Spatial bias ๋ฐฉ์ง€
  • Flexible deployment

5.4 Plug-and-Play Design

ํ•ต์‹ฌ ํ˜์‹ :

  • Vision encoder level: ์•„ํ‚คํ…์ฒ˜ ๋…๋ฆฝ์ 
  • Training-free option: ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • Modular implementation: ์‰ฌ์šด ํ†ตํ•ฉ

์žฅ์ :

  • LLaVA-1.5, Video-LLaVA ๋“ฑ ์ฆ‰์‹œ ์ ์šฉ
  • Minimal code changes
  • Research-friendly

  1. Limitations ๋ฐ ํ–ฅํ›„ ๋ฐฉํ–ฅ

ํ˜„์žฌ ํ•œ๊ณ„ (๋…ผ๋ฌธ ๊ธฐ์ค€)

1. Not Entirely Lossless

  • Visual token compression์ด ์™„์ „ํžˆ losslessํ•˜์ง€ ์•Š์Œ
  • ์›๋ณธ LLaVA ๋Œ€๋น„ marginal performance gap ์กด์žฌ
  • PruMerge+ (25%)๋กœ ๋Œ€๋ถ€๋ถ„ ํ•ด๊ฒฐ๋˜๋‚˜ ์™„์ „ํ•˜์ง€ ์•Š์Œ

2. Large-Scale Model ๊ฒ€์ฆ ๋ถ€์กฑ

  • Academic setting์˜ computational resources ํ•œ๊ณ„
  • LLaVA-Next with Yi-34B ๋“ฑ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฒ€์ฆ ๋ฏธ์™„๋ฃŒ

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ (๋…ผ๋ฌธ ๊ธฐ์ค€)

1. Fully Lossless Compression

  • ์™„์ „ ๋ฌด์†์‹ค ํ† ํฐ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ
  • Performance gap ์™„์ „ ์ œ๊ฑฐ ๋ชฉํ‘œ

2. Larger-Scale Models ํ™•์žฅ

  • LLaVA-Next with Yi-34B backbone ๋“ฑ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์ ์šฉ
  • Generalization ๋ฐ broader impact ๊ฒ€์ฆ

  1. Conclusion

LLaVA-PruMerge๋Š” Large Multimodal Models์˜ ํšจ์œจ์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ :

ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. Adaptive token selection: IQR-based outlier detection
  2. Information-preserving merging: k-NN clustering + weighted averaging
  3. PruMerge+: Attention + spatial hybrid strategy
  4. 14๋ฐฐ / 4๋ฐฐ ์••์ถ•: ์„ฑ๋Šฅ ์œ ์ง€ํ•˜๋ฉด์„œ ๋Œ€ํญ ์••์ถ•

์˜์˜:

  • Visual token ์ˆ˜ ๊ด€์ ์˜ ์ตœ์ดˆ ํšจ์œจํ™” ์—ฐ๊ตฌ
  • Plug-and-play ๋ฐฉ์‹์œผ๋กœ ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅ
  • Training-free option์œผ๋กœ ๋น ๋ฅธ ๋ฐฐํฌ
  • Video-LLM์—๋„ ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅ

์‹ค์šฉ์„ฑ:

  • 10๋ฐฐ FLOPs ๊ฐ์†Œ
  • 5.8~10.2๋ฐฐ ๋น ๋ฅธ prefill
  • 50% ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ
  • Quantization๊ณผ orthogonal (๊ฒฐํ•ฉ ๊ฐ€๋Šฅ)

LLaVA-PruMerge๋Š” ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•์„ ์ด๋ฃจ๋ฉฐ, LMM์˜ ์‹ค์šฉ์  ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ์ค‘์š”ํ•œ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.