[Paper Review] LLaVA: Visual Instruction Tuning - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„

Posted by Euisuk's Dev Log on December 12, 2025

[Paper Review] LLaVA: Visual Instruction Tuning - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„

https://arxiv.org/pdf/2304.08485

๋…ผ๋ฌธ ์ •๋ณด

  • ์ œ๋ชฉ: Visual Instruction Tuning
  • ์ €์ž: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
  • ์†Œ์†: University of Wisconsinโ€“Madison, Microsoft Research, Columbia University
  • ๋ฐœํ‘œ: NeurIPS 2023
  • ๋…ผ๋ฌธ ๋งํฌ: https://arxiv.org/abs/2304.08485
  • GitHub: https://github.com/haotian-liu/LLaVA

  1. Introduction: Visual Instruction Tuning์˜ ํƒ„์ƒ

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์€ instruction tuning์„ ํ†ตํ•ด zero-shot ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ChatGPT์™€ GPT-4์˜ ์„ฑ๊ณต์€ ์–ธ์–ด ์˜์—ญ์—์„œ instruction-following์˜ ๊ฐ•๋ ฅํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์˜์—ญ์—์„œ์˜ instruction tuning์€ ๊ฑฐ์˜ ํƒ๊ตฌ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด ์ปดํ“จํ„ฐ ๋น„์ „ ์—ฐ๊ตฌ๋“ค์€:

  • ๊ฐ ํƒœ์Šคํฌ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ•ด๊ฒฐ (classification, detection, segmentation ๋“ฑ)
  • ๊ณ ์ •๋œ ์ธํ„ฐํŽ˜์ด์Šค์™€ ์ œํ•œ์ ์ธ ์ƒํ˜ธ์ž‘์šฉ์„ฑ
  • ์–ธ์–ด๋ฅผ ์ด๋ฏธ์ง€ ์„ค๋ช…์—๋งŒ ํ™œ์šฉ

LLaVA์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

LLaVA(Large Language and Vision Assistant)๋Š” vision๊ณผ language๋ฅผ ํ†ตํ•ฉํ•œ end-to-end ํ•™์Šต ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ์ :

  1. Multimodal Instruction-Following Data

    • ์–ธ์–ด ์ „์šฉ GPT-4๋ฅผ ํ™œ์šฉํ•˜์—ฌ vision-language instruction ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
    • Image-text ์Œ์„ instruction-following ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
    • ์ด 158K๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
  2. Large Multimodal Model (LMM)

    • CLIP ๋น„์ „ ์ธ์ฝ”๋”์™€ Vicuna ์–ธ์–ด ๋ชจ๋ธ ์—ฐ๊ฒฐ
    • End-to-end ํ•™์Šต์œผ๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ instruction-following ๋Šฅ๋ ฅ ํš๋“
    • GPT-4์™€์˜ ์•™์ƒ๋ธ”๋กœ Science QA์—์„œ 92.53% ์ •ํ™•๋„ ๋‹ฌ์„ฑ (์ƒˆ๋กœ์šด SOTA)
  3. ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ

    • LLaVA-Bench (COCO): ์ผ๊ด€์„ฑ ์žˆ๋Š” ํ‰๊ฐ€
    • LLaVA-Bench (In-the-Wild): ๋‹ค์–‘ํ•˜๊ณ  ๋„์ „์ ์ธ ์‹ค์„ธ๊ณ„ ํƒœ์Šคํฌ

์„ฑ๊ณผ

  • GPT-4 ๋Œ€๋น„ 85.1% ์ƒ๋Œ€ ์ ์ˆ˜ ๋‹ฌ์„ฑ (synthetic multimodal instruction-following ๋ฐ์ดํ„ฐ์…‹)
  • Unseen ์ด๋ฏธ์ง€/๋ช…๋ น์—์„œ GPT-4์™€ ์œ ์‚ฌํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ–‰๋™ ์ „์‹œ
  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ, ์ฝ”๋“œ, ๋ชจ๋ธ์„ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœ

Multimodal Instruction-Following Agents

์‚ฌ๋žŒ์—๊ฒŒ โ€œ์ด ์‚ฌ์ง„์—์„œ ๊ณ ์–‘์ด๋ฅผ ์ฐพ์•„์„œ ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ์น ํ•ด์ค˜โ€๋ผ๊ณ  ๋งํ•˜๋ฉด ๋Œ€๋ถ€๋ถ„ ์‰ฝ๊ฒŒ ํ•ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ AI์—๊ฒŒ๋Š” ์ด๋ฏธ์ง€ ์ดํ•ด, ์ž์—ฐ์–ด ํ•ด์„, ํ–‰๋™ ์‹คํ–‰์ด๋ผ๋Š” ์„ธ ๊ฐ€์ง€ ๋Šฅ๋ ฅ์ด ๋™์‹œ์— ํ•„์š”ํ•œ ์–ด๋ ค์šด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ž๋“ค์€ ์ด ๋ฌธ์ œ๋ฅผ ํฌ๊ฒŒ ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ์ ‘๊ทผํ•ด์™”์Šต๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ๋Š” End-to-End ํ•™์Šต ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ž…๋ ฅ๋ถ€ํ„ฐ ์ถœ๋ ฅ๊นŒ์ง€ ํ•˜๋‚˜์˜ ์‹ ๊ฒฝ๋ง์ด ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, Vision-Language Navigation(VLN), Habitat 2.0, InstructPix2Pix ๋“ฑ์ด ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค.

  • VLN์€ โ€œ๊ฑฐ์‹ค๋กœ ๊ฐ€์„œ ๋นจ๊ฐ„ ์†ŒํŒŒ ์˜†์— ์„œโ€์™€ ๊ฐ™์€ ์ง€์‹œ๋ฅผ ๋ฐ›์•„ ๋กœ๋ด‡์ด ์‹ค์ œ๋กœ ์ด๋™ํ•˜๋Š” ํƒœ์Šคํฌ๋ฅผ ๋‹ค๋ฃจ๊ณ , InstructPix2Pix๋Š” โ€œ์ด ์‚ฌ์ง„์„ ํ‘๋ฐฑ์œผ๋กœ ๋ฐ”๊ฟ”์ค˜โ€ ๊ฐ™์€ ํŽธ์ง‘ ์ง€์‹œ๋ฅผ ๋ฐ›์•„ ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฐ ๋ชจ๋ธ๋“ค์€ ์ถ”๋ก  ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ  ํ•™์Šต๋œ ํƒœ์Šคํฌ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ๊ฐ ๋ชจ๋ธ์ด ํŠน์ • ํƒœ์Šคํฌ์—๋งŒ ํŠนํ™”๋˜์–ด ์žˆ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ณ„๋„์˜ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ๋Š” ์‹œ์Šคํ…œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์ž…๋‹ˆ๋‹ค.

LLM์„ ์ง€ํœ˜์ž(Orchestrator)๋กœ ํ™œ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ์ „๋ฌธ ๋ชจ๋ธ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ˜ธ์ถœํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. Visual ChatGPT, MM-REACT, VisProg, ViperGPT ๋“ฑ์ด ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

  • ์˜ˆ๋ฅผ ๋“ค์–ด Visual ChatGPT๋Š” ์‚ฌ์šฉ์ž๊ฐ€ โ€œ๋ฐฐ๊ฒฝ์„ ์ œ๊ฑฐํ•˜๊ณ  ์œ ํ™” ์Šคํƒ€์ผ๋กœ ๋ฐ”๊ฟ”์ค˜โ€๋ผ๊ณ  ์š”์ฒญํ•˜๋ฉด, ChatGPT๊ฐ€ ์ด ์ง€์‹œ๋ฅผ ๋ถ„์„ํ•˜์—ฌ Segment Anything์œผ๋กœ ๋ฐฐ๊ฒฝ์„ ์ œ๊ฑฐํ•˜๊ณ , Stable Diffusion์œผ๋กœ ์Šคํƒ€์ผ์„ ๋ณ€ํ™˜ํ•˜๋Š” ์‹์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์ˆœ์ฐจ ํ˜ธ์ถœํ•˜๋ฏ€๋กœ ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ , ์•ž์„  ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๊ฐ€ ๋’ค๋”ฐ๋ฅด๋Š” ๋ชจ๋ธ๋กœ ์ „ํŒŒ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

LLaVA๋Š” ์ด ๋‘ ์ ‘๊ทผ๋ฒ• ์‚ฌ์ด์—์„œ ๊ท ํ˜•์ ์„ ์ฐพ์Šต๋‹ˆ๋‹ค. End-to-End ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ(๋‹จ์ผ ๋ชจ๋ธ, ๋น ๋ฅธ ์ถ”๋ก )์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์‹œ์Šคํ…œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์˜ ๋ฒ”์šฉ์„ฑ(๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ ์ฒ˜๋ฆฌ)์„ ๊ฐ–์ถ”๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

Instruction Tuning in NLP

Instruction Tuning์€ ์‚ฌ์ „ํ•™์Šต๋œ LLM์„ ์ž์—ฐ์–ด ์ง€์‹œ๋ฌธ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ถ”๊ฐ€ ํ•™์Šต์‹œ์ผœ, ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ๋ฅผ zero-shot์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • GPT-3๋Š” InstructGPT๋กœ, T5๋Š” FLAN-T5๋กœ, PaLM์€ FLAN-PaLM์œผ๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ GPT-3์—๊ฒŒ โ€œTranslate โ€˜helloโ€™ to Koreanโ€์ด๋ผ๊ณ  ์ž…๋ ฅํ•˜๋ฉด ๋ฌธ์žฅ์„ ๊ทธ๋ƒฅ ์ด์–ด์„œ ์ƒ์„ฑํ•˜์ง€๋งŒ, InstructGPT๋Š” ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ  โ€œ์•ˆ๋…•ํ•˜์„ธ์š”โ€๋ผ๊ณ  ์ •ํ™•ํžˆ ์‘๋‹ตํ•ฉ๋‹ˆ๋‹ค.

https://medium.com/@lmpo/an-overview-instruction-tuning-for-llms-440228e7edab

LLaVA๋Š” ์ด Instruction Tuning ์•„์ด๋””์–ด๋ฅผ ๋น„์ „ ๋ถ„์•ผ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™•๋ณดํ•˜๋А๋ƒ์ธ๋ฐ, LLaVA๋Š” GPT-4๋ฅผ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. GPT-4๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ๋ณผ ์ˆ˜ ์—†์ง€๋งŒ, ์ด๋ฏธ์ง€์˜ ์บก์…˜์ด๋‚˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ •๋ณด๋ฅผ ํ…์ŠคํŠธ๋กœ ์ œ๊ณต๋ฐ›์œผ๋ฉด ํ•ด๋‹น ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๊ณ ํ’ˆ์งˆ Q&A ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” NLP์—์„œ ๊ฒ€์ฆ๋œ Self-Instruct ๋ฐฉ์‹์˜ ๋ณ€ํ˜•์œผ๋กœ, LLaVA๋Š” ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ visual instruction-following ๋ฐ์ดํ„ฐ๋กœ vision-language ๋ชจ๋ธ์„ ํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด Large Multimodal Models (LMMs)

LLaVA๊ฐ€ ๋“ฑ์žฅํ•˜๊ธฐ ์ „์—๋„ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜๋Š” Large Multimodal Model๋“ค์ด ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ํ๋ฆ„์˜ ์‹œ์ž‘์ ์œผ๋กœ ๋งŽ์ด ์–ธ๊ธ‰๋˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Flamingo์ž…๋‹ˆ๋‹ค.

https://arxiv.org/abs/2204.14198

Flamingo๋Š” โ€œ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ GPT-3 ์ˆœ๊ฐ„โ€์ด๋ผ ๋ถˆ๋ฆด ๋งŒํผ ์ƒ์ง•์ ์ธ ๋ชจ๋ธ๋กœ, ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜์—ฌ zero-shot task transfer์™€ in-context learning ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) GPT-3(2020) ์ด์ „์˜ NLP ๋ชจ๋ธ๋“ค์€ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด ํ•ด๋‹น ํƒœ์Šคํฌ์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋กœ fine-tuning์ด ํ•„์ˆ˜์˜€์Šต๋‹ˆ๋‹ค.

  • ๊ทธ๋Ÿฐ๋ฐ GPT-3๋Š” ๋ณ„๋„ fine-tuning ์—†์ด ํ”„๋กฌํ”„ํŠธ์— ๋ช‡ ๊ฐœ์˜ ์˜ˆ์‹œ๋งŒ ๋ณด์—ฌ์ฃผ๋ฉด(few-shot) ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒŒ ๋ฐ”๋กœ in-context learning์ด๊ณ , โ€œscaling์ด ๊ณง ๋Šฅ๋ ฅ์ด๋‹คโ€๋ผ๋Š” ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์—ด์—ˆ์ฃ .

๋งˆ์น˜ GPT-3๊ฐ€ NLP ๋ถ„์•ผ์—์„œ few-shot learning์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ๋“ฏ์ด, Flamingo๋Š” ๋น„์ „-์–ธ์–ด ๋ถ„์•ผ์—์„œ ๋น„์Šทํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์„ ์ด๋Œ์—ˆ์Šต๋‹ˆ๋‹ค.

Flamingo๋Š” GPT-3์ฒ˜๋Ÿผ few-shot in-context learning์„ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์˜์—ญ์—์„œ ์ฒ˜์Œ์œผ๋กœ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์˜ˆ์‹œ ๋ช‡ ๊ฐœ๋งŒ ํ”„๋กฌํ”„ํŠธ์— ๋„ฃ์–ด์ฃผ๋ฉด
  • Fine-tuning ์—†์ด VQA, ์บก์…”๋‹, ๋ถ„๋ฅ˜ ๋“ฑ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ ์ˆ˜ํ–‰
  • ์‹ฌ์ง€์–ด ์ผ๋ถ€ ๋ฒค์น˜๋งˆํฌ์—์„œ fine-tuned ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€

Flamingo ์ดํ›„๋กœ ๋‹ค์–‘ํ•œ image-text ์Œ ๊ธฐ๋ฐ˜ ํ•™์Šต ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. BLIP-2๋Š” frozen image encoder์™€ LLM์„ Q-Former๋ผ๋Š” ๊ฒฝ๋Ÿ‰ ๋ชจ๋“ˆ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ํšจ์œจ์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๊ณ , FROMAGe๋Š” ํ…์ŠคํŠธ ์ƒ์„ฑ๊ณผ ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰(retrieval)์„ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”์—ˆ์Šต๋‹ˆ๋‹ค. KOSMOS-1์€ ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ์—์„œ ๋ฐœํ‘œํ•œ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํƒœ์Šคํฌ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, PaLM-E๋Š” ๊ตฌ๊ธ€์—์„œ ๋กœ๋ด‡ ์ œ์–ด์™€ ๊ฐ™์€ embodied AI ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค ์ง„์˜์—์„œ๋„ OpenFlamingo์™€ LLaMA-Adapter ๋“ฑ์ด ๊ณต๊ฐœ๋˜๋ฉด์„œ ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์ ‘๊ทผ์„ฑ์ด ๋†’์•„์กŒ์Šต๋‹ˆ๋‹ค.

  • BLIP-2, FROMAGe ๋“ฑ์€ contrastive learning์ด๋‚˜ captioning ๊ฐ™์€ ํŠน์ • ํƒœ์Šคํฌ๋กœ ํ•™์Šต๋˜์–ด ๊ทธ ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. Flamingo๋Š” few-shot in-context learning์œผ๋กœ ๋” ์œ ์—ฐํ–ˆ์ง€๋งŒ, ์ด๊ฒƒ๋„ ์˜ˆ์‹œ ํŒจํ„ด์„ ๋”ฐ๋ผํ•˜๋Š” ๋ฐฉ์‹์ด์ง€ ์ž์—ฐ์–ด ์ง€์‹œ๋ฅผ ์ง์ ‘ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, โ€˜์ด ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ๋“ค์˜ ๊ฐ์ •์„ ๋ถ„์„ํ•ด์ค˜โ€™ ๊ฐ™์€ ๋ณต์žกํ•œ instruction์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋”ฐ๋ฅด๋„๋ก ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

NLP ๋ถ„์•ผ์—์„œ GPT-3๊ฐ€ InstructGPT๋กœ ๋ฐœ์ „ํ•˜๋ฉฐ ์‚ฌ์šฉ์ž ์ง€์‹œ๋ฅผ ํ›จ์”ฌ ์ž˜ ๋”ฐ๋ฅด๊ฒŒ ๋œ ๊ฒƒ์ฒ˜๋Ÿผ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์—๋„ ์ด๋Ÿฐ instruction tuning์ด ํ•„์š”ํ–ˆ์ง€๋งŒ ์•„์ง ์ฒด๊ณ„์ ์œผ๋กœ ์—ฐ๊ตฌ๋˜์ง€ ์•Š์€ ์ƒํƒœ์˜€์Šต๋‹ˆ๋‹ค.

  • ๊ทธ ๊ฒฐ๊ณผ, ์ด ๋ชจ๋ธ๋“ค์€ ์–ธ์–ด ์ „์šฉ ํƒœ์Šคํฌ์—์„œ๋Š” ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ๋„, ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๋ณต์žกํ•œ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ฑฐ๋‚˜ ์‹œ๊ฐ์  reasoning์ด ํ•„์š”ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํƒœ์Šคํฌ์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

LLaVA๋Š” ๋ฐ”๋กœ ์ด ๊ฒฉ์ฐจ๋ฅผ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ชจ๋ธ๋“ค์ด ๋†“์น˜๊ณ  ์žˆ๋˜ visual instruction tuning์„ ์ฒด๊ณ„์ ์œผ๋กœ ์—ฐ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํƒœ์Šคํฌ์—์„œ์˜ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด LLaVA์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.


  1. GPT-assisted Visual Instruction Data Generation

๋ฌธ์ œ์ : ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ Instruction ๋ฐ์ดํ„ฐ ๋ถ€์กฑ

  • Image-text ์Œ (CC, LAION)์€ ํ’๋ถ€
  • ํ•˜์ง€๋งŒ multimodal instruction-following ๋ฐ์ดํ„ฐ๋Š” ๋งค์šฐ ์ œํ•œ์ 
  • ์ด์œ : ์ˆ˜์ž‘์—… ์ˆ˜์ง‘์ด ์‹œ๊ฐ„ ์†Œ๋ชจ์ ์ด๊ณ  ์ •์˜๊ฐ€ ๋ชจํ˜ธํ•จ

ํ•ด๊ฒฐ์ฑ…: GPT-4 ํ™œ์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

https://arxiv.org/pdf/2304.08485

GPT-4๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ๋ณผ ์ˆ˜ ์—†์œผ๋‹ˆ๊นŒ, ํ…์ŠคํŠธ๋กœ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์„ค๋ช…ํ•ด์ค๋‹ˆ๋‹ค. (์ด๋ฏธ์ง€ ์ƒ๋‹จ: GPT-4์—๊ฒŒ ์ฃผ๋Š” ์ž…๋ ฅ (Context))

Symbolic Representation

์ด๋ฏธ์ง€๋ฅผ ์–ธ์–ด ์ „์šฉ GPT-4๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์ธ์ฝ”๋”ฉ:

1. Captions (์ด๋ฏธ์ง€ ์„ค๋ช…)

  • ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ์‹œ๊ฐ์  ์žฅ๋ฉด ๋ฌ˜์‚ฌ
  • ์˜ˆ์‹œ:

    1
    2
    3
    
    A group of people standing outside of a black vehicle with various luggage.
    Luggage surrounds a vehicle in an underground parking area.
    People try to fit all of their luggage in an SUV.
    

2. Bounding Boxes (๊ฐ์ฒด ์œ„์น˜)

  • ๊ฐ์ฒด ๊ฐœ๋…๊ณผ ๊ณต๊ฐ„ ์ •๋ณด ์ธ์ฝ”๋”ฉ
  • ์˜ˆ์‹œ:

    1
    2
    3
    
    person: [0.681, 0.242, 0.774, 0.694]
    backpack: [0.384, 0.696, 0.485, 0.914]
    suitcase: [0.758, 0.413, 0.845, 0.69]
    

์ด context๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ GPT-4๊ฐ€ 3๊ฐ€์ง€ ์œ ํ˜•์˜ Q&A ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (์ด๋ฏธ์ง€ ํ•˜๋‹จ: GPT-4๊ฐ€ ์ƒ์„ฑํ•˜๋Š” ์ถœ๋ ฅ (Response))

์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์œ ํ˜•

Response type 1: Conversation (๋‹จ์ˆœ ๋Œ€ํ™”)

  • ์‚ฌ๋žŒ๊ณผ ์–ด์‹œ์Šคํ„ดํŠธ ๊ฐ„์˜ ๋Œ€ํ™”
  • ๊ฐ์ฒด ์œ ํ˜•, ๊ฐœ์ˆ˜, ํ–‰๋™, ์œ„์น˜, ์ƒ๋Œ€์  ์œ„์น˜ ๋“ฑ์— ๋Œ€ํ•œ ์งˆ๋ฌธ
  • ๋ช…ํ™•ํ•œ ๋‹ต๋ณ€์ด ์žˆ๋Š” ์งˆ๋ฌธ๋งŒ ํฌํ•จ
1
2
Q: What type of vehicle is featured in the image?
A: The image features a black sport utility vehicle (SUV)...

Response type 2: Detailed Description (์ƒ์„ธ ์„ค๋ช…)

  • ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํ’๋ถ€ํ•˜๊ณ  ํฌ๊ด„์ ์ธ ์„ค๋ช…
  • GPT-4๊ฐ€ ์งˆ๋ฌธ ๋ชฉ๋ก ์ƒ์„ฑ ํ›„ ์„ ๋ณ„
1
2
3
4
Q: (์•”๋ฌต์ ์œผ๋กœ "์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•ด์ค˜")
A: The image is an underground parking area with a black SUV parked. 
   There are three people in the scene, with one person standing closer 
   to the left side of the vehicle...

Response type 3: Complex Reasoning (๋ณต์žกํ•œ ์ถ”๋ก )

  • ์‹œ๊ฐ ์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ์‹ฌ์ธต ์ถ”๋ก  ์งˆ๋ฌธ
  • ๋‹จ๊ณ„๋ณ„ ๋…ผ๋ฆฌ์  ์ถ”๋ก  ๊ณผ์ • ํ•„์š”
1
2
3
4
Q: What challenges do these people face?
A: They are facing the challenge of fitting all their luggage into the 
   black SUV. There are multiple suitcases and backpacks to be packed, 
   which suggests that the group has a significant amount of belongings...

  1. Visual Instruction Tuning

4.1 Architecture: 3-Component ์„ค๊ณ„

์ „์ฒด ๊ตฌ์กฐ

https://arxiv.org/pdf/2304.08485

๊ตฌ์„ฑ ์š”์†Œ

1. Vision Encoder: CLIP ViT-L/14

  • Pre-trained, frozen ์ƒํƒœ ์œ ์ง€
  • ์ž…๋ ฅ: 336ร—336px ์ด๋ฏธ์ง€
  • ์ถœ๋ ฅ: Grid features (Zv)
    • 576 patch tokens (24ร—24 grid)
    • 1 class token
    • Dimension: 1024

2. Projection Layer

Hv=Wโ‹…Zv,withZv=g(Xv)H_v = W ยท Z_v, with Z_v = g(X_{v})Hvโ€‹=Wโ‹…Zvโ€‹,withZvโ€‹=g(Xvโ€‹)

  • Trainable projection matrix W
  • ๋น„์ „ features๋ฅผ ์–ธ์–ด embedding ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜
  • Lightweight ์„ค๊ณ„๋กœ ๋น ๋ฅธ ์‹คํ—˜ ๋ฐ˜๋ณต ๊ฐ€๋Šฅ

3. Language Model: Vicuna

  • ๊ณต๊ฐœ๋œ ์ฒดํฌํฌ์ธํŠธ ์ค‘ ์ตœ๊ณ ์˜ instruction-following ๋Šฅ๋ ฅ
  • ํŒŒ๋ผ๋ฏธํ„ฐ ฯ•๋กœ ํ‘œํ˜„

์„ค๊ณ„ ์ฒ ํ•™

  • Simple but Effective:
    • Linear projection ์‚ฌ์šฉ
  • ๋” ์ •๊ตํ•œ ๋ฐฉ์‹๋„ ๊ฐ€๋Šฅ:
    • Flamingo์˜ gated cross-attention
    • BLIP-2์˜ Q-former
  • ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๋กœ ๋‚จ๊น€

4.2 Training: 2-Stage Procedure

๋ฐ์ดํ„ฐ ํ˜•์‹

๊ฐ ์ด๋ฏธ์ง€ Xv\mathbf{X}_vXvโ€‹์— ๋Œ€ํ•ด multi-turn conversation ๋ฐ์ดํ„ฐ (Xq1,Xa1,โ‹ฏโ€‰,XqT,XaT)(\mathbf{X}_q^1, \mathbf{X}_a^1, \cdots, \mathbf{X}_q^T, \mathbf{X}_a^T)(Xq1โ€‹,Xa1โ€‹,โ‹ฏ,XqTโ€‹,XaTโ€‹)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ TTT๋Š” ์ด ํ„ด ์ˆ˜์ด๊ณ , Xq\mathbf{X}_qXqโ€‹๋Š” ์งˆ๋ฌธ, Xa\mathbf{X}_aXaโ€‹๋Š” ์‘๋‹ต์ž…๋‹ˆ๋‹ค.

ttt๋ฒˆ์งธ ํ„ด์˜ instruction Xinstructt\mathbf{X}_{\text{instruct}}^tXinstructtโ€‹๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  • ์ฒซ ๋ฒˆ์งธ ํ„ด (t=1): [Xq,Xv][\mathbf{X}_q, \mathbf{X}_v][Xqโ€‹,Xvโ€‹] ๋˜๋Š” [Xv,Xq][\mathbf{X}_v, \mathbf{X}_q][Xvโ€‹,Xqโ€‹] ์ค‘ ๋žœ๋ค ์„ ํƒ (์งˆ๋ฌธ-์ด๋ฏธ์ง€ ์ˆœ์„œ๋ฅผ ์„ž์–ด์„œ ๋ชจ๋ธ์ด ๋‘˜ ๋‹ค ์ต์ˆ™ํ•ด์ง€๋„๋ก)
  • ์ดํ›„ ํ„ด (t>1): Xqt\mathbf{X}_q^tXqtโ€‹๋งŒ ์‚ฌ์šฉ (์ด๋ฏธ์ง€๋Š” ์ด๋ฏธ ์ œ๊ณต๋จ)

์ž…๋ ฅ ์‹œํ€€์Šค ํ˜•์‹

ํ•™์Šต ๋ชฉํ‘œ

LLM์˜ ๊ธฐ์กด auto-regressive training objective๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ธธ์ด LLL์ธ ์‹œํ€€์Šค์—์„œ target answer Xa\mathbf{X}_aXaโ€‹์˜ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค:

p(XaโˆฃXv,Xinstruct)=โˆi=1Lpฮธ(xiโˆฃXv,Xinstruct,<i,Xa,<i)p(\mathbf{X}_a \mathbf{X}_v, \mathbf{X}_{\text{instruct}}) = \prod_{i=1}^{L} p_\theta(x_i \mathbf{X}_v, \mathbf{X}_{\text{instruct},<i}, \mathbf{X}_{a,<i})p(Xaโ€‹โˆฃXvโ€‹,Xinstructโ€‹)=โˆi=1Lโ€‹pฮธโ€‹(xiโ€‹โˆฃXvโ€‹,Xinstruct,<iโ€‹,Xa,<iโ€‹)

์ค‘์š”ํ•œ ์ ์€ Assistant์˜ ์‘๋‹ต ํ† ํฐ(xix_ixiโ€‹)๋งŒ loss ๊ณ„์‚ฐ์— ์‚ฌ์šฉ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Human์˜ ์งˆ๋ฌธ์ด๋‚˜ ์ด๋ฏธ์ง€ ํ† ํฐ์—๋Š” loss๋ฅผ ๊ฑธ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Stage 1: Pre-training for Feature Alignment

๋ชฉ์ :

  • Visual tokenizer ํ•™์Šต
    (์ด๋ฏธ์ง€ features๋ฅผ LLM word embedding ๊ณต๊ฐ„์— ์ •๋ ฌ)

๋ฐ์ดํ„ฐ:

  • CC3M ํ•„ํ„ฐ๋ง โ†’ 595K image-text ์Œ
  • Concept coverage์™€ ํ•™์Šต ํšจ์œจ์„ฑ ๊ท ํ˜•

ํ•™์Šต ์„ค์ •:

1
2
3
4
5
6
7
X_system-message <STOP>
Human: X_instruct <STOP> Assistant: X_a <STOP>

- X_instruct = [X_q, X_v] ๋˜๋Š” [X_v, X_q] (๋žœ๋ค ์„ ํƒ)
- X_q: ์ด๋ฏธ์ง€ ์„ค๋ช… ์š”์ฒญ ์งˆ๋ฌธ (๋žœ๋ค ์ƒ˜ํ”Œ๋ง)
- X_v: ์ด๋ฏธ์ง€ features
- X_a: ์›๋ณธ ์บก์…˜ (ground-truth)

ํŒŒ๋ผ๋ฏธํ„ฐ:

  • Frozen: Vision encoder, LLM
  • Trainable: W (projection matrix)๋งŒ
  • ํ•™์Šต๋ฅ : 2e-3
  • Batch size: 128
  • Epochs: 1
  • ์‹œ๊ฐ„: 4์‹œ๊ฐ„ (8ร—A100)

๊ฒฐ๊ณผ: ์ด๋ฏธ์ง€ features HvH_vHvโ€‹๊ฐ€ LLM word embedding๊ณผ ์ •๋ ฌ๋จ

Stage 2: Fine-tuning End-to-End

๋ชฉ์ :

  • Instruction-following ๋Šฅ๋ ฅ ํš๋“

ํŒŒ๋ผ๋ฏธํ„ฐ:

  • Frozen: Vision encoder
  • Trainable: W + ฯ• (projection + LLM)

Multi-turn Conversation ํ˜•์‹:

1
2
3
4
5
6
7
8
9
Xsystem-message <STOP>
Human: Xยน_instruct <STOP> Assistant: Xยน_a <STOP>
Human: Xยฒ_instruct <STOP> Assistant: Xยฒ_a <STOP>
...

- X_instruct = [X_q, X_v] ๋˜๋Š” [X_v, X_q] (๋žœ๋ค ์„ ํƒ)
- X_q: ์ด๋ฏธ์ง€ ์„ค๋ช… ์š”์ฒญ ์งˆ๋ฌธ (๋žœ๋ค ์ƒ˜ํ”Œ๋ง)
- X_v: ์ด๋ฏธ์ง€ features
- X_a: ์›๋ณธ ์บก์…˜ (ground-truth)

XinstructtX^{t}_{instruct}Xinstructtโ€‹ ๊ตฌ์„ฑ:

  • t=1 (์ฒซ ํ„ด): [Xq1X^1_qXq1โ€‹, XvX_vXvโ€‹] ๋˜๋Š” [XvX_vXvโ€‹, Xq1X^1_qXq1โ€‹] (๋žœ๋ค)
  • t>1: XqtX^t_qXqtโ€‹

Loss Computation:

p(XaโˆฃXv,Xinstruct)=โˆi=1Lpฮธ(xiโˆฃXv,Xinstruct,<i,Xa,<i)p(X_a X_v, X_{instruct}) = โˆ_{i=1}^Lpฮธ(x_i X_v, X_{instruct,<i}, X_{a,<i})p(Xaโ€‹โˆฃXvโ€‹,Xinstructโ€‹)=i=1โˆLโ€‹pฮธ(xiโ€‹โˆฃXvโ€‹,Xinstruct,<iโ€‹,Xa,<iโ€‹)
  • ์˜ˆ์ธก ํ† ํฐ(xix_ixiโ€‹)๋งŒ loss ๊ณ„์‚ฐ
  • Auto-regressive training objective

๊ฒฐ๊ณผ: LLM์ด visual context๋ฅผ ์ดํ•ดํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ instruction์„ ๋”ฐ๋ฅด๋Š” Multimodal Chatbot ๋Šฅ๋ ฅ ํš๋“


Use Case 1: Multimodal Chatbot

๋ฐ์ดํ„ฐ: 158K language-image instruction-following samples

  • Conversation (multi-turn)
  • Detailed description (single-turn)
  • Complex reasoning (single-turn)
  • Uniform sampling during training

ํ•™์Šต ์„ค์ •:

  • ํ•™์Šต๋ฅ : 2e-5
  • Batch size: 32
  • Epochs: 3
  • ์‹œ๊ฐ„: 10์‹œ๊ฐ„ (8ร—A100)

Use Case 2: Science QA

๋ฐ์ดํ„ฐ์…‹: ScienceQA benchmark

  • 21K multimodal multiple choice questions
  • 3 subjects, 26 topics, 127 categories, 379 skills
  • ์ž์—ฐ์–ด ์„ค๋ช…๊ณผ ์ด๋ฏธ์ง€ ์ปจํ…์ŠคํŠธ ์ œ๊ณต

ํ˜•์‹: Single-turn conversation

  • X_instruct: Question & context
  • X_a: Reasoning process + answer

ํ•™์Šต ์„ค์ •:

  • Visual features: Before last layer ์‚ฌ์šฉ
  • ์ถœ๋ ฅ ์ˆœ์„œ: Reasons โ†’ Answer
  • Epochs: 12
  • ์‹œ๊ฐ„: 4์‹œ๊ฐ„ (8ร—A100)

  1. Experiments: ์„ฑ๋Šฅ ๊ฒ€์ฆ

5.1 Multimodal Chatbot ํ‰๊ฐ€

Qualitative Analysis: GPT-4 ์˜ˆ์ œ ํ…Œ์ŠคํŠธ

Example: Extreme Ironing

LLaVA ์‘๋‹ต:

1
2
3
4
5
6
The unusual aspect of this image is a man ironing clothes on the back
of a minivan or van. This is not a typical place to perform this activity,
as one would usually iron clothes in a more stationary and safe location,
such as a home, using a regular ironing board. The scene depicted in the
image is peculiar as it involves a makeshift ironing setup on a vehicle,
which can be both unsafe and unconventional...

GPT-4 ์‘๋‹ต:

1
2
The unusual thing about this image is that a man is ironing clothes on
an ironing board attached to the roof of a moving taxi.

BLIP-2 ์‘๋‹ต:

1
a man is sitting on the back of a yellow cab

OpenFlamingo ์‘๋‹ต:

1
The man is drying his clothes on the hood of his car.

๋ถ„์„:

  • โœ… LLaVA: ์ง€์‹œ์‚ฌํ•ญ์„ ์ •ํ™•ํžˆ ๋”ฐ๋ฅด๊ณ  ์ƒ์„ธํ•œ ์„ค๋ช… ์ œ๊ณต
  • โœ… GPT-4: ๊ฐ„๊ฒฐํ•˜์ง€๋งŒ ์ •ํ™•
  • โŒ BLIP-2, OpenFlamingo: ๋‹จ์ˆœ ์ด๋ฏธ์ง€ ์„ค๋ช…, instruction ๋ฌด์‹œ

Quantitative Evaluation: GPT-4 ๊ธฐ๋ฐ˜ ํ‰๊ฐ€

ํ‰๊ฐ€ ๋ฐฉ๋ฒ•:

  1. Triplet ์ƒ์„ฑ: (Image, Ground-truth description, Question)
  2. ๋ชจ๋ธ ์‘๋‹ต ์ƒ์„ฑ
  3. Text-only GPT-4๋ฅผ judge๋กœ ํ™œ์šฉ
  • Ground-truth description ๊ธฐ๋ฐ˜ ์ฐธ์กฐ ๋‹ต๋ณ€ ์ƒ์„ฑ
  • ๋‘ ์‘๋‹ต ๋น„๊ต (helpfulness, relevance, accuracy, detail)
  • 1-10์  ์ฒ™๋„ ํ‰๊ฐ€

LLaVA-Bench (COCO): 90 questions

Training Data Conversation Detail Complex All
Full data 83.1 75.3 96.5 85.1
Detail + Complex 81.5 (-1.6) 73.3 (-2.0) 90.8 (-5.7) 81.9 (-3.2)
Conv + 5% Detail + 10% Complex 81.0 (-2.1) 68.4 (-7.1) 91.5 (-5.0) 80.5 (-4.4)
Conversation 76.5 (-6.6) 59.8 (-16.2) 84.9 (-12.4) 73.8 (-11.3)
No Instruction Tuning 22.0 (-61.1) 24.0 (-51.3) 18.5 (-78.0) 21.5 (-63.6)

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:

  1. Instruction tuning์œผ๋กœ 50์  ์ด์ƒ ํ–ฅ์ƒ
  2. Detailed description + complex reasoning ์ถ”๊ฐ€ ์‹œ 7์  ํ–ฅ์ƒ
  3. Reasoning ๋Šฅ๋ ฅ์ด conversation ๋Šฅ๋ ฅ๋„ ๋ณด์™„
  4. ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์œ ํ˜• ๋ชจ๋‘ ์‚ฌ์šฉ ์‹œ ์ตœ๊ณ  ์„ฑ๋Šฅ

LLaVA-Bench (In-the-Wild): 60 questions

Model Conversation Detail Complex All
OpenFlamingo 19.3 ยฑ 0.5 19.0 ยฑ 0.5 19.1 ยฑ 0.7 19.1 ยฑ 0.4
BLIP-2 54.6 ยฑ 1.4 29.1 ยฑ 1.2 32.9 ยฑ 0.7 38.1 ยฑ 1.0
LLaVA 57.3 ยฑ 1.9 52.5 ยฑ 6.3 81.7 ยฑ 1.8 67.3 ยฑ 2.0

์„ฑ๊ณผ:

  • BLIP-2 ๋Œ€๋น„ +29% ํ–ฅ์ƒ
  • OpenFlamingo ๋Œ€๋น„ +48% ํ–ฅ์ƒ
  • Complex reasoning์—์„œ text-only GPT-4 ๋Œ€๋น„ 81.7% ๋‹ฌ์„ฑ

Limitations ๋ถ„์„

๋„์ „์ ์ธ ์˜ˆ์ œ๋“ค:

  1. Ramen ์˜ˆ์ œ: ๋ ˆ์Šคํ† ๋ž‘ ์ด๋ฆ„ ์ธ์‹
  • ๋‹ค๊ตญ์–ด ์ดํ•ด ๋ฐ ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์‹ ํ•„์š”
  • ๋ฐ˜์ฐฌ ์„ค๋ช…์—๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ณด ๊ฒ€์ƒ‰ ํ•„์š”
  1. Fridge ์˜ˆ์ œ: ์š”๊ฑฐํŠธ ๋ธŒ๋žœ๋“œ ์ธ์‹
    • ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํ•„์š”
    • ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์‹ coverage ์š”๊ตฌ

ํฅ๋ฏธ๋กœ์šด ์‹คํŒจ ์‚ฌ๋ก€:

  • โ€œ๋”ธ๊ธฐ ๋ง› ์š”๊ฑฐํŠธ๊ฐ€ ์žˆ๋‚˜์š”?โ€ โ†’ โ€œYesโ€
  • ์‹ค์ œ๋กœ๋Š” ์š”๊ฑฐํŠธ์™€ ๋”ธ๊ธฐ๊ฐ€ ๋”ฐ๋กœ ์กด์žฌ
  • LLaVA๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ โ€œbag of patchesโ€๋กœ ์ธ์‹
  • ๋ณต์žกํ•œ semantic ๊ด€๊ณ„ ํŒŒ์•… ์‹คํŒจ

5.2 ScienceQA ๋ฒค์น˜๋งˆํฌ

๊ฒฐ๊ณผ (Test Set Accuracy %)

Model NAT SOC LAN TXT IMG NO G1-6 G7-12 Avg
Human 90.23 84.97 87.48 89.60 87.50 88.10 91.59 82.42 88.40
GPT-3.5 CoT 75.44 70.87 78.09 74.68 67.43 79.93 78.23 69.68 75.17
LLaMA-Adapter 84.37 88.30 84.36 83.72 80.32 86.90 85.83 84.05 85.19
MM-CoT_Large 95.91 82.00 90.82 95.26 88.80 92.89 92.44 90.31 91.68
LLaVA 90.36 95.95 88.00 89.49 88.00 90.66 90.93 90.90 90.92
LLaVA+GPT-4 (judge) 91.56 96.74 91.09 90.62 88.99 93.52 92.73 92.16 92.53

ํ•ต์‹ฌ ์„ฑ๊ณผ:

  1. LLaVA ๋‹จ๋…์œผ๋กœ 90.92% (SoTA MM-CoT_Large์™€ ๊ทผ์ ‘)
  2. GPT-4 judge ์•™์ƒ๋ธ”๋กœ 92.53% (์ƒˆ๋กœ์šด SOTA)
  3. Text-only GPT-4(82.69%)๊ฐ€ multimodal ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌ

Design Ablations

Variant Before Last Last
Best 90.92 89.96 (-0.96)
Predict answer first - 89.77 (-1.15)
Train from scratch 85.81 (-5.11) -
7B model 89.84 (-1.08) -

๋ฐœ๊ฒฌ:

  1. Visual features: Before last layer๊ฐ€ 0.96% ๋” ๋†’์Œ
  • Last layer: ๊ธ€๋กœ๋ฒŒ/์ถ”์ƒ ์†์„ฑ ์ง‘์ค‘
  • Before last: ์ง€์—ญ์  ์„ธ๋ถ€ ์ •๋ณด ๋ณด์กด
  1. Chain-of-Thought: Reasoning-first ์ „๋žต

    • ์ˆ˜๋ ด ์†๋„ ํ–ฅ์ƒ (6 epochs vs 12 epochs)
    • ์ตœ์ข… ์„ฑ๋Šฅ์—๋Š” ์ž‘์€ ์˜ํ–ฅ
  2. Pre-training: 5.11% ํ–ฅ์ƒ ๊ธฐ์—ฌ

    • Multimodal feature alignment
    • Pre-trained knowledge ๋ณด์กด
  3. Model size: 13B > 7B (1.08% ์ฐจ์ด)


  1. ์ฝ”๋“œ ๊ตฌํ˜„ ๋ถ„์„

์•„๋ž˜๋Š” https://github.com/haotian-liu/LLaVA ์ฝ”๋“œ๋ฅผ ๋ถ„์„ ํ›„ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

6.1 ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ๊ตฌ์กฐ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
LLaVA/
โ”œโ”€โ”€ llava/                    # ํ•ต์‹ฌ ํŒจํ‚ค์ง€
โ”‚   โ”œโ”€โ”€ model/                # ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
โ”‚   โ”‚   โ”œโ”€โ”€ llava_arch.py     # ํ•ต์‹ฌ vision-language ๋ชจ๋ธ
โ”‚   โ”‚   โ”œโ”€โ”€ builder.py        # ๋ชจ๋ธ ๋กœ๋”ฉ ๋ฐ ์ธ์Šคํ„ด์Šคํ™”
โ”‚   โ”‚   โ”œโ”€โ”€ multimodal_encoder/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ clip_encoder.py   # CLIP ๋น„์ „ ์ธ์ฝ”๋”
โ”‚   โ”‚   โ””โ”€โ”€ multimodal_projector/
โ”‚   โ”‚       โ””โ”€โ”€ builder.py    # Vision-language bridge
โ”‚   โ”œโ”€โ”€ train/                # ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ”‚   โ”œโ”€โ”€ train.py          # ๋ฉ”์ธ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ
โ”‚   โ”‚   โ””โ”€โ”€ llava_trainer.py  # ์ปค์Šคํ…€ ํŠธ๋ ˆ์ด๋„ˆ
โ”‚   โ”œโ”€โ”€ serve/                # ์ถ”๋ก  ์„œ๋น™
โ”‚   โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ”‚   โ”œโ”€โ”€ gradio_web_server.py  # Web UI
โ”‚   โ”‚   โ””โ”€โ”€ model_worker.py
โ”‚   โ”œโ”€โ”€ mm_utils.py           # ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ””โ”€โ”€ conversation.py       # ๋Œ€ํ™” ๊ด€๋ฆฌ
โ”œโ”€โ”€ scripts/                  # ํ•™์Šต/ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ”œโ”€โ”€ pretrain.sh
โ”‚   โ”œโ”€โ”€ finetune.sh
โ”‚   โ””โ”€โ”€ finetune_lora.sh
โ””โ”€โ”€ predict.py                # ์ถ”๋ก  ์ธํ„ฐํŽ˜์ด์Šค

6.2 ํ•ต์‹ฌ Architecture ๊ตฌํ˜„

1. Vision Encoder (CLIP)

1
2
3
4
5
6
7
# CLIP Vision Model ๋กœ๋”ฉ
self.vision_tower = CLIPVisionModel.from_pretrained(vision_tower_name)
self.vision_tower.requires_grad_(False)  # โญ ํ•ญ์ƒ Frozen

def forward(self, images):
    image_features = self.vision_tower(images, output_hidden_states=True)
    return image_features  # [batch, num_patches, 1024]

Vision Encoder๋Š” CLIP ViT-L/14๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, Stage 1, 2 ๋ชจ๋‘ frozen ์ƒํƒœ๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.

2. Multimodal Projector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def build_vision_projector(config):
    projector_type = config.mm_projector_type
    
    if projector_type == 'linear':
        # ๋‹จ์ˆœ ์„ ํ˜• ๋ณ€ํ™˜: 1024 โ†’ 4096
        return nn.Linear(config.mm_hidden_size, config.hidden_size)
    
    if projector_type == 'mlp2x_gelu':
        # 2-layer MLP: Linear โ†’ GELU โ†’ Linear
        return nn.Sequential(
            nn.Linear(config.mm_hidden_size, config.hidden_size),
            nn.GELU(),
            nn.Linear(config.hidden_size, config.hidden_size)
        )

Projector๋Š” Vision Encoder ์ถœ๋ ฅ(1024์ฐจ์›)์„ LLM embedding ๊ณต๊ฐ„(4096์ฐจ์›)์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

3. ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฒฐํ•ฉ (ํ•ต์‹ฌ ๋กœ์ง)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def prepare_inputs_labels_for_multimodal(self, input_ids, images, labels, ...):
    # 1) ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”ฉ
    image_features = self.encode_images(images)  # [batch, 256, 4096]
    
    # 2) IMAGE_TOKEN_INDEX ์œ„์น˜ ์ฐพ๊ธฐ
    image_token_indices = torch.where(input_ids == IMAGE_TOKEN_INDEX)[0]
    
    # 3) ํ…์ŠคํŠธ embedding๊ณผ ์ด๋ฏธ์ง€ features ๊ฒฐํ•ฉ
    for i, idx in enumerate(image_token_indices):
        # ์ด๋ฏธ์ง€ ํ† ํฐ ์ด์ „ ํ…์ŠคํŠธ
        cur_new_input_embeds.append(embed_tokens(input_ids[:idx]))
        # ์ด๋ฏธ์ง€ features ์‚ฝ์ž…
        cur_new_input_embeds.append(image_features[i])
        # ์ด๋ฏธ์ง€ ํ† ํฐ ์ดํ›„ ํ…์ŠคํŠธ
        cur_new_input_embeds.append(embed_tokens(input_ids[idx+1:]))
    
    # 4) ์ด๋ฏธ์ง€ ์œ„์น˜์—๋Š” IGNORE_INDEX๋กœ loss ์ œ์™ธ
    image_labels = torch.full((num_patches,), IGNORE_INDEX)
    
    return torch.cat(cur_new_input_embeds, dim=0)

ํ•ต์‹ฌ ํ๋ฆ„:

1
2
3
4
5
6
7
์ž…๋ ฅ: "Human: <image> ์ด ์‚ฌ์ง„์„ ์„ค๋ช…ํ•ด์ค˜"
        โ†“
1. <image> ํ† ํฐ ์œ„์น˜ ์ฐพ๊ธฐ
2. ํ•ด๋‹น ์œ„์น˜์— image_features (256๊ฐœ ํ† ํฐ) ์‚ฝ์ž…
3. ์ด๋ฏธ์ง€ ํ† ํฐ์—๋Š” loss ๊ณ„์‚ฐ ์ œ์™ธ (IGNORE_INDEX)
        โ†“
์ถœ๋ ฅ: [ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ] + [์ด๋ฏธ์ง€ 256ํ† ํฐ] + [ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ]

6.3 Training Pipeline

Stage 1 vs Stage 2 ํ•ต์‹ฌ ์„ค์ •

1
2
3
4
5
6
7
8
9
10
# Stage 1: Projection๋งŒ ํ•™์Šต
if model_args.tune_mm_mlp_adapter:
    model.requires_grad_(False)                      # ์ „์ฒด frozen
    for p in model.get_model().mm_projector.parameters():
        p.requires_grad = True                       # projector๋งŒ trainable

# Stage 2: Projection + LLM ํ•™์Šต
if training_args.freeze_mm_mlp_adapter:
    for p in model.get_model().mm_projector.parameters():
        p.requires_grad = False                      # projector frozen (์„ ํƒ์ )

ํ•™์Šต ํ๋ฆ„ ์š”์•ฝ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train():
    # 1) ๋ชจ๋ธ ๋กœ๋”ฉ
    model = LlavaLlamaForCausalLM.from_pretrained(model_name_or_path)
    
    # 2) Vision Tower ์ดˆ๊ธฐํ™” (CLIP)
    model.get_model().initialize_vision_modules(model_args)
    vision_tower = model.get_vision_tower()
    vision_tower.to(dtype=torch.bfloat16, device=device)
    
    # 3) ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋”ฉ
    data_module = make_supervised_data_module(tokenizer, data_args)
    
    # 4) ํ•™์Šต ์‹คํ–‰
    trainer = LLaVATrainer(model=model, tokenizer=tokenizer, **data_module)
    trainer.train()
    
    # 5) ๋ชจ๋ธ ์ €์žฅ
    trainer.save_state()

6.4 Data Processing

๋ฐ์ดํ„ฐ ํ˜•์‹ (JSON)

1
2
3
4
5
6
7
{
  "image": "image_001.jpg",
  "conversations": [
    {"from": "human", "value": "<image>\n์ด ์‚ฌ์ง„์„ ์„ค๋ช…ํ•ด์ค˜"},
    {"from": "gpt", "value": "์ด ์‚ฌ์ง„์—๋Š” ๊ณ ์–‘์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค..."}
  ]
}

์ „์ฒ˜๋ฆฌ ํ•ต์‹ฌ ๋กœ์ง

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class LazySupervisedDataset(Dataset):
    def __getitem__(self, i):
        # 1) ์ด๋ฏธ์ง€ ๋กœ๋”ฉ ๋ฐ ์ „์ฒ˜๋ฆฌ
        image = Image.open(image_path).convert('RGB')
        image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
        
        # 2) <image> ํ† ํฐ ์œ„์น˜ ์ •๊ทœํ™”
        #    "์ด ์‚ฌ์ง„ <image> ์„ค๋ช…ํ•ด์ค˜" โ†’ "<image>\n์ด ์‚ฌ์ง„ ์„ค๋ช…ํ•ด์ค˜"
        sentence['value'] = DEFAULT_IMAGE_TOKEN + '\n' + sentence['value']
        
        # 3) Conversation โ†’ input_ids, labels ๋ณ€ํ™˜
        data_dict = preprocess(sources, tokenizer, has_image=True)
        
        return {
            'input_ids': data_dict['input_ids'],
            'labels': data_dict['labels'],      # Assistant ์‘๋‹ต๋งŒ loss ๊ณ„์‚ฐ
            'image': image
        }

Labels ์ฒ˜๋ฆฌ (Loss ๊ณ„์‚ฐ ๋Œ€์ƒ)

1
2
3
4
์ž…๋ ฅ:   [Human: <image> ์„ค๋ช…ํ•ด์ค˜] [Assistant: ๊ณ ์–‘์ด์ž…๋‹ˆ๋‹ค]
labels: [    IGNORE_INDEX       ] [    ์‹ค์ œ ํ† ํฐ ID        ]
                โ†‘                            โ†‘
        loss ๊ณ„์‚ฐ ์•ˆ ํ•จ              loss ๊ณ„์‚ฐ ๋Œ€์ƒ

6.5 Inference Pipeline

์ถ”๋ก  ํ๋ฆ„

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class Predictor:
    def setup(self):
        # ๋ชจ๋ธ ๋กœ๋”ฉ
        self.tokenizer, self.model, self.image_processor, _ = load_pretrained_model(
            model_path="liuhaotian/llava-v1.5-13b"
        )
    
    def predict(self, image, prompt, temperature=0.2, max_tokens=1024):
        # 1) ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ
        image_tensor = process_images([image], self.image_processor)
        image_tensor = image_tensor.to(self.model.device, dtype=torch.float16)
        
        # 2) Conversation ๊ตฌ์„ฑ
        conv = conv_templates["vicuna_v1"].copy()
        inp = DEFAULT_IMAGE_TOKEN + "\n" + prompt   # "<image>\n์‚ฌ์šฉ์ž ์งˆ๋ฌธ"
        conv.append_message(conv.roles[0], inp)      # Human
        conv.append_message(conv.roles[1], None)     # Assistant (์ƒ์„ฑ ๋Œ€์ƒ)
        
        # 3) Tokenization
        input_ids = tokenizer_image_token(
            conv.get_prompt(),
            self.tokenizer,
            IMAGE_TOKEN_INDEX,
            return_tensors='pt'
        ).to(self.model.device)
        
        # 4) ์ƒ์„ฑ
        with torch.inference_mode():
            output = self.model.generate(
                inputs=input_ids,
                images=image_tensor,
                temperature=temperature,
                max_new_tokens=max_tokens
            )
        
        return self.tokenizer.decode(output[0])

์ถ”๋ก  ์‹œ ์ž…๋ ฅ ํ˜•ํƒœ

1
2
3
4
5
6
7
8
[System Message] <STOP>
Human: <image>
์ด ์‚ฌ์ง„์„ ์„ค๋ช…ํ•ด์ค˜ <STOP>
Assistant:
    โ†“
[์‹œ์Šคํ…œ ํ† ํฐ๋“ค] + [์ด๋ฏธ์ง€ 256ํ† ํฐ] + [์งˆ๋ฌธ ํ† ํฐ๋“ค]
    โ†“
LLM์ด ๋‹ค์Œ ํ† ํฐ ์ƒ์„ฑ ์‹œ์ž‘

  1. ์ฃผ์š” ํŠน์ง• ๋ฐ ํ˜์‹ 

7.1 Data-Centric Approach

  • GPT-4 ํ™œ์šฉ: Language-only ๋ชจ๋ธ๋กœ vision-language ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
  • Symbolic Representation: Captions + Bounding boxes
  • ๋‹ค์–‘์„ฑ ํ™•๋ณด: Conversation, Detailed description, Complex reasoning
  • ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ: ๋” ๋งŽ์€ image-text ์Œ์— ์ ์šฉ ๊ฐ€๋Šฅ

7.2 Simple yet Effective Architecture

  • Frozen Components: Vision encoder, LLM ๊ณ ์ •
  • Lightweight Connector: Linear projection (2-layer MLP)
  • ๋น ๋ฅธ ํ•™์Šต: 4~10์‹œ๊ฐ„ (8ร—A100)
  • ํšจ์œจ์„ฑ: Pre-training ๋‹จ๊ณ„์—์„œ alignment ๋‹ฌ์„ฑ

7.3 Two-Stage Training Strategy

Stage 1 (Feature Alignment):

  • 595K caption pairs
  • Projection matrix๋งŒ ํ•™์Šต
  • Visual tokenizer ์—ญํ• 

Stage 2 (Instruction Tuning):

  • 158K instruction data
  • LLM + projection ํ•™์Šต
  • Vision encoder frozen

7.4 Emergent Capabilities

Out-of-distribution Generalization:

  • Elon Musk ์ธ์‹ (ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†์—ˆ์Œ)
  • Meme ์ดํ•ด
  • OCR ๋Šฅ๋ ฅ (ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ฑฐ์˜ ์—†์—ˆ์Œ)

Multi-turn Conversation:

  • Context ์œ ์ง€
  • Follow-up ์งˆ๋ฌธ ์ฒ˜๋ฆฌ
  • Detailed explanations

  1. Limitations ๋ฐ ํ–ฅํ›„ ๋ฐฉํ–ฅ

ํ˜„์žฌ ํ•œ๊ณ„

  1. Hallucination: ์‚ฌ์‹ค๊ณผ ๋‹ค๋ฅธ ์ถœ๋ ฅ ์ƒ์„ฑ
  2. ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€: ์„ธ๋ฐ€ํ•œ ํ…์ŠคํŠธ/๋ธŒ๋žœ๋“œ ์ธ์‹ ์–ด๋ ค์›€
  3. Complex Semantics: โ€œBag of patchesโ€ ๋ฌธ์ œ
    • ์˜ˆ: ๋”ธ๊ธฐ + ์š”๊ฑฐํŠธ โ†’ ๋”ธ๊ธฐ ๋ง› ์š”๊ฑฐํŠธ (X)
  4. Multilingual: ๋‹ค๊ตญ์–ด ์ง€์› ์ œํ•œ์ 

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  1. ๋” ์ •๊ตํ•œ Connector:

    • Gated cross-attention
    • Q-former
    • Multi-scale features
  2. ๊ณ ํ•ด์ƒ๋„ ์ฒ˜๋ฆฌ:

    • Patch-based processing
    • Adaptive resolution
  3. ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹:

    • Diverse domains
    • More languages
    • Complex reasoning
  4. Model Scaling:

    • 65B+ LLM variants
    • Larger vision encoders

  1. ์‚ฌํšŒ์  ์˜ํ–ฅ (Broader Impact)

Risks

Malicious Input:

  • OpenAI Filter API๋กœ ์œ ํ•ด ํ…์ŠคํŠธ ์ฐจ๋‹จ
  • NSFW Filter๋กœ ๋ถ€์ ์ ˆํ•œ ์ด๋ฏธ์ง€ ์ฐจ๋‹จ

Hallucination:

  • ์˜๋ฃŒ ๋“ฑ critical application์—์„œ ์ฃผ์˜ ํ•„์š”

Biases:

  • CLIP, LLaMA/Vicuna๋กœ๋ถ€ํ„ฐ ํŽธํ–ฅ ์ „์ด ๊ฐ€๋Šฅ

Energy Consumption:

  • ํ˜„์žฌ๋Š” ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต (595K)
  • Scaling ์‹œ ์—๋„ˆ์ง€ ์†Œ๋น„ ๊ณ ๋ ค ํ•„์š”

Benefits

Research Community:

  • ๋ชจ๋“  ์ž์‚ฐ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœ
  • Reproducibility ํ™•๋ณด
  • Community ๊ธฐ์—ฌ ๊ฐ€๋Šฅ

Accessibility:

  • ๋‹ค์–‘ํ•œ vision-language ํƒœ์Šคํฌ ํ†ตํ•ฉ
  • User-friendly interface

  1. Conclusion

LLaVA๋Š” visual instruction tuning์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. GPT-4 ๊ธฐ๋ฐ˜ 158K multimodal instruction data ์ƒ์„ฑ
  2. Simple yet effective 3-component architecture
  3. ScienceQA์—์„œ 92.53% ๋‹ฌ์„ฑ (SOTA)
  4. LLaVA-Bench: ์ตœ์ดˆ์˜ multimodal instruction-following ๋ฒค์น˜๋งˆํฌ

์˜์˜:

  • Language-only LLM โ†’ Multimodal LMM ํ™•์žฅ ๊ฒฝ๋กœ ์ œ์‹œ
  • Data-centric approach์˜ ํšจ๊ณผ ์ž…์ฆ
  • End-to-end ํ•™์Šต์˜ ๊ฐ€๋Šฅ์„ฑ ์ฆ๋ช…

์˜คํ”ˆ์†Œ์Šค:

  • ๋ฐ์ดํ„ฐ, ์ฝ”๋“œ, ๋ชจ๋ธ ๋ชจ๋‘ ๊ณต๊ฐœ
  • Community-driven improvement ๊ฐ€๋Šฅ

LLaVA๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI์˜ democratization์„ ์œ„ํ•œ ์ค‘์š”ํ•œ ๋ฐฉํ–ฅ์„ฑ์„ ์ œ์‹œํ•œ ์ค‘์š”ํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)