[Paper Review] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Posted by Euisuk's Dev Log on August 29, 2025

[Paper Review] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-Qwen-VL-A-Versatile-Vision-Language-Model-for-Understanding-Localization-Text-Reading-and-Beyond

https://arxiv.org/abs/2308.12966

1
WANG, Peng, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.

์ดˆ๋ก

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•˜๋„๋ก ์„ค๊ณ„๋œ ๋Œ€๊ทœ๋ชจ vision-language ๋ชจ๋ธ(LVLM)์ธ Qwen-VL ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Qwen-LM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ, ์„ธ์‹ฌํ•˜๊ฒŒ ์„ค๊ณ„๋œ (i) visual receptor, (ii) input-output interface, (iii) 3๋‹จ๊ณ„ training pipeline, (iv) ๋‹ค๊ตญ์–ด multimodal ์ •์ œ ์ฝ”ํผ์Šค๋ฅผ ํ†ตํ•ด visual capacity๋ฅผ ๋ถ€์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€ ์„ค๋ช… ๋ฐ ์งˆ์˜์‘๋‹ต์„ ๋„˜์–ด์„œ, image-caption-box tuple์„ ์ •๋ ฌํ•˜์—ฌ Qwen-VL์˜ grounding ๋ฐ text-reading ๋Šฅ๋ ฅ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. Qwen-VL๊ณผ Qwen-VL-Chat์„ ํฌํ•จํ•œ ๊ฒฐ๊ณผ ๋ชจ๋ธ๋“ค์€ ๋น„์Šทํ•œ ๋ชจ๋ธ ๊ทœ๋ชจ์˜ generalist ๋ชจ๋ธ๋“ค ์ค‘์—์„œ ๋‹ค์–‘ํ•œ visual-centric benchmark(์˜ˆ: ์ด๋ฏธ์ง€ ์บก์…”๋‹, ์งˆ์˜์‘๋‹ต, visual grounding)์™€ ๋‹ค์–‘ํ•œ ์„ค์ •(์˜ˆ: zero-shot, few-shot)์—์„œ ์ƒˆ๋กœ์šด ๊ธฐ๋ก์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์‹ค์ œ ๋Œ€ํ™” benchmark์—์„œ๋„ instruction-tuned๋œ Qwen-VL-Chat์ด ๊ธฐ์กด vision-language chatbot๋“ค์— ๋น„ํ•ด ์šฐ์ˆ˜์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ชจ๋ธ์€ ํ–ฅํ›„ ์—ฐ๊ตฌ๋ฅผ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๊ฐœ๋ฉ๋‹ˆ๋‹ค.

  1. ์„œ๋ก 

์ตœ๊ทผ Large Language Model(LLM) (Brown et al., 2020; OpenAI, 2023; Anil et al., 2023; Gao et al., 2023; Qwen, 2023)๋“ค์ด ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ฐ ์ดํ•ด์˜ ๊ฐ•๋ ฅํ•œ ๋Šฅ๋ ฅ์œผ๋กœ ์ธํ•ด ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ instruction fine-tuning์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž ์˜๋„์™€ ๋”์šฑ ์ž˜ ์ •๋ ฌ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ•๋ ฅํ•œ ์ƒํ˜ธ์ž‘์šฉ ๋Šฅ๋ ฅ๊ณผ ์ง€๋Šฅํ˜• ์–ด์‹œ์Šคํ„ดํŠธ๋กœ์„œ ์ƒ์‚ฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ native large language model๋“ค์€ ์ˆœ์ˆ˜ํ•œ ํ…์ŠคํŠธ ์„ธ๊ณ„์—๋งŒ ์กด์žฌํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์ผ๋ฐ˜์ ์ธ modality(์ด๋ฏธ์ง€, ์Œ์„ฑ, ๋น„๋””์˜ค ๋“ฑ)๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•ด ์‘์šฉ ๋ฒ”์œ„์— ํฐ ์ œ์•ฝ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋™๊ธฐ๋กœ, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ์‹œ๊ฐ์  ์‹ ํ˜ธ๋ฅผ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ Large Vision Language Model(LVLM) ๊ทธ๋ฃน (Alayrac et al., 2022; Chen et al., 2022; Li et al., 2023c; Dai et al., 2023; Huang et al., 2023; Peng et al., 2023; Zhu et al., 2023; Liu et al., 2023; Ye et al., 2023b,a; Chen et al., 2023a; Li et al., 2023a; Zhang et al., 2023; Sun et al., 2023; OpenAI, 2023)์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋Œ€๊ทœ๋ชจ vision-language ๋ชจ๋ธ๋“ค์€ ์‹ค์ œ vision-central ๋ฌธ์ œ ํ•ด๊ฒฐ์—์„œ ์œ ๋งํ•œ ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  LVLM์˜ ํ•œ๊ณ„์™€ ์ž ์žฌ๋ ฅ์„ ํƒ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ˆ˜ํ–‰๋˜์—ˆ์Œ์—๋„, ํ˜„์žฌ ์˜คํ”ˆ์†Œ์Šค LVLM๋“ค์€ ํ•ญ์ƒ ๋ถ€์ ์ ˆํ•œ ํ›ˆ๋ จ๊ณผ ์ตœ์ ํ™”๋กœ ์ธํ•ด ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋…์  ๋ชจ๋ธ๋“ค(Chen et al., 2022, 2023b; OpenAI, 2023)๋ณด๋‹ค ํ›จ์”ฌ ๋’ค์ฒ˜์ ธ ์žˆ์–ด ์˜คํ”ˆ์†Œ์Šค ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ์˜ LVLM์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ํƒ๊ตฌ์™€ ์‘์šฉ์„ ์ €ํ•ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋”์šฑ์ด ์‹ค์ œ ์‹œ๊ฐ์  ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” ์ƒ๋‹นํžˆ ๋ณต์žกํ•˜๋ฏ€๋กœ, ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ์  ์ดํ•ด๊ฐ€ LVLM์ด ์‚ฌ๋žŒ๋“ค์„ ํšจ๊ณผ์ ์ด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋„์šธ ์ˆ˜ ์žˆ๋Š” ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ์†Œ์ˆ˜์˜ ์‹œ๋„๋งŒ์ด ์ด๋ฃจ์–ด์กŒ์œผ๋ฉฐ(Peng et al., 2023; Chen et al., 2023a), ๋Œ€๋ถ€๋ถ„์˜ ์˜คํ”ˆ์†Œ์Šค LVLM๋“ค์€ ์—ฌ์ „ํžˆ ๊ฑฐ์นœ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ object grounding์ด๋‚˜ text reading๊ณผ ๊ฐ™์€ ์„ธ๋ฐ€ํ•œ ์ธ์‹์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์„ ๋ชจ์ƒ‰ํ•˜๊ณ  ์˜คํ”ˆ์†Œ์Šค Qwen ํŒจ๋ฐ€๋ฆฌ์˜ ์ตœ์‹  ๊ตฌ์„ฑ์›์ธ Qwen-VL ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. Qwen-VL๋“ค์€ Qwen-7B (Qwen, 2023) ์–ธ์–ด ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ณ ์„ฑ๋Šฅ์ด๊ณ  ๋‹ค์žฌ๋‹ค๋Šฅํ•œ vision-language foundation ๋ชจ๋ธ ์‹œ๋ฆฌ์ฆˆ์ž…๋‹ˆ๋‹ค. ์–ธ์–ด ์ •๋ ฌ๋œ visual encoder์™€ ์œ„์น˜ ์ธ์‹ adapter๋ฅผ ํฌํ•จํ•œ ์ƒˆ๋กœ์šด visual receptor๋ฅผ ๋„์ž…ํ•˜์—ฌ LLM basement์— visual capacity๋ฅผ ๋ถ€์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์™€ input-output interface๋Š” ์ƒ๋‹นํžˆ ๊ฐ„๊ฒฐํ•˜๋ฉฐ, ๋ฐฉ๋Œ€ํ•œ image-text corpus collection์—์„œ ์ „์ฒด ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด 3๋‹จ๊ณ„ training pipeline์„ ์ •๊ตํ•˜๊ฒŒ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์ „ ํ›ˆ๋ จ๋œ checkpoint์ธ Qwen-VL์€ ์‹œ๊ฐ์  ์ž…๋ ฅ์„ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•˜๋ฉฐ, ์ฃผ์–ด์ง„ prompt์— ๋”ฐ๋ผ ์›ํ•˜๋Š” ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฏธ์ง€ ์บก์…”๋‹, ์งˆ์˜์‘๋‹ต, ํ…์ŠคํŠธ ์ง€ํ–ฅ ์งˆ์˜์‘๋‹ต, visual grounding๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ vision-language ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Qwen-VL-Chat์€ Qwen-VL์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ instruction-tuned vision-language chatbot์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 2์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด, Qwen-VL-Chat์€ ์‚ฌ์šฉ์ž์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ์‚ฌ์šฉ์ž์˜ ์˜๋„์— ๋”ฐ๋ผ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, Qwen-VL ์‹œ๋ฆฌ์ฆˆ ๋ชจ๋ธ์˜ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

โ€ข ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ: Qwen-VL๋“ค์€ ๋น„์Šทํ•œ ๊ทœ๋ชจ์˜ counterpart๋“ค์— ๋น„ํ•ด ๋‹ค์–‘ํ•œ vision-centric ์ดํ•ด benchmark์—์„œ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Qwen-VL์˜ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์€ ๊ธฐ์กด benchmark(์˜ˆ: captioning, question-answering, grounding) ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ๊ทผ์— ๋„์ž…๋œ ์ผ๋ถ€ ๋Œ€ํ™” benchmark์—์„œ๋„ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.

โ€ข ๋‹ค๊ตญ์–ด: Qwen-LM๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, Qwen-VL๋“ค์€ ์ƒ๋‹นํ•œ ์–‘์˜ corpus๊ฐ€ ์˜์–ด์™€ ์ค‘๊ตญ์–ด๋กœ ๊ตฌ์„ฑ๋œ ๋‹ค๊ตญ์–ด image-text ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ Qwen-VL๋“ค์€ ์˜์–ด, ์ค‘๊ตญ์–ด ๋ฐ ๋‹ค๊ตญ์–ด instruction์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

โ€ข Multi-image: ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์ž„์˜๋กœ interleaved๋œ image-text ๋ฐ์ดํ„ฐ๋ฅผ Qwen-VL์˜ ์ž…๋ ฅ์œผ๋กœ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์„ ํ†ตํ•ด Qwen-Chat-VL์€ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ context๋ฅผ ๋น„๊ต, ์ดํ•ด ๋ฐ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ€ข ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ์  ์ดํ•ด: ํ›ˆ๋ จ์—์„œ ์‚ฌ์šฉํ•œ ๋” ๋†’์€ ํ•ด์ƒ๋„์˜ ์ž…๋ ฅ ํฌ๊ธฐ์™€ ์„ธ๋ฐ€ํ•œ corpus ๋•๋ถ„์—, Qwen-VL๋“ค์€ ๋†’์€ ๊ฒฝ์Ÿ๋ ฅ์„ ๊ฐ€์ง„ ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ์  ์ดํ•ด ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ธฐ์กด vision-language generalist๋“ค๊ณผ ๋น„๊ตํ•ด, Qwen-VL๋“ค์€ grounding, text-reading, ํ…์ŠคํŠธ ์ง€ํ–ฅ ์งˆ์˜์‘๋‹ต, ์„ธ๋ฐ€ํ•œ ๋Œ€ํ™” ์„ฑ๋Šฅ์—์„œ ํ›จ์”ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

  1. ๋ฐฉ๋ฒ•๋ก 

2.1 ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

Qwen-VL์˜ ์ „์ฒด ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ๋ชจ๋ธ parameter์˜ ์„ธ๋ถ€ ์‚ฌํ•ญ์€ ํ‘œ 1์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค:

Large Language Model: Qwen-VL์€ large language model์„ ๊ธฐ์ดˆ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ Qwen-7B (Qwen, 2023)์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ weights๋กœ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.

Visual Encoder: Qwen-VL์˜ visual encoder๋Š” Vision Transformer (ViT) (Dosovitskiy et al., 2021) ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, Openclip์˜ ViT-bigG (Ilharco et al., 2021)์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ weights๋กœ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ ์ถ”๋ก  ๊ณผ์ •์—์„œ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” ํŠน์ • ํ•ด์ƒ๋„๋กœ ํฌ๊ธฐ๊ฐ€ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค. visual encoder๋Š” ์ด๋ฏธ์ง€๋ฅผ stride 14๋กœ patch๋“ค๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๊ณ , ์ด๋ฏธ์ง€ feature๋“ค์˜ set์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Position-aware Vision-Language Adapter: ๊ธด ์ด๋ฏธ์ง€ feature sequence๋กœ ์ธํ•œ ํšจ์œจ์„ฑ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, Qwen-VL์€ ์ด๋ฏธ์ง€ feature๋“ค์„ ์••์ถ•ํ•˜๋Š” vision-language adapter๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด adapter๋Š” ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋œ single-layer cross-attention module๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋“ˆ์€ trainable vector๋“ค(Embedding) ๊ทธ๋ฃน์„ query vector๋กœ ์‚ฌ์šฉํ•˜๊ณ , visual encoder์˜ ์ด๋ฏธ์ง€ feature๋“ค์„ cross-attention ์—ฐ์‚ฐ์˜ key๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ visual feature sequence๋ฅผ 256์˜ ๊ณ ์ • ๊ธธ์ด๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค. query ์ˆ˜์— ๋Œ€ํ•œ ablation์€ ๋ถ€๋ก E.2์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์„ธ๋ฐ€ํ•œ ์ด๋ฏธ์ง€ ์ดํ•ด๋ฅผ ์œ„ํ•œ ์œ„์น˜ ์ •๋ณด์˜ ์ค‘์š”์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ, ์••์ถ• ์ค‘ ์œ„์น˜ ์„ธ๋ถ€ ์‚ฌํ•ญ์˜ ์ž ์žฌ์  ์†์‹ค์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด 2D absolute positional encoding์ด cross-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ query-key pair์— ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค. ๊ธธ์ด 256์˜ ์••์ถ•๋œ ์ด๋ฏธ์ง€ feature sequence๋Š” ์ดํ›„ large language model์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

2.2 ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ

Image Input: ์ด๋ฏธ์ง€๋Š” visual encoder์™€ adapter๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋˜์–ด ๊ณ ์ • ๊ธธ์ด์˜ ์ด๋ฏธ์ง€ feature sequence๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ feature ์ž…๋ ฅ๊ณผ ํ…์ŠคํŠธ feature ์ž…๋ ฅ์„ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด, ๋‘ ๊ฐœ์˜ ํŠน์ˆ˜ ํ† ํฐ(์™€ )์ด ์ด๋ฏธ์ง€ feature sequence์˜ ์‹œ์ž‘๊ณผ ๋์— ๊ฐ๊ฐ ์ถ”๊ฐ€๋˜์–ด ์ด๋ฏธ์ง€ ์ฝ˜ํ…์ธ ์˜ ์‹œ์ž‘๊ณผ ๋์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Bounding Box Input and Output: ๋ชจ๋ธ์˜ ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ์  ์ดํ•ด ๋ฐ grounding capacity๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, Qwen-VL์˜ ํ›ˆ๋ จ์—๋Š” ์ง€์—ญ ์„ค๋ช…, ์งˆ๋ฌธ ๋ฐ detection ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์„ค๋ช…์ด๋‚˜ ์งˆ๋ฌธ์„ ํฌํ•จํ•˜๋Š” ๊ธฐ์กด ์ž‘์—…๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ž‘์—…์€ ๋ชจ๋ธ์ด ์ง€์ •๋œ ํ˜•์‹์œผ๋กœ ์ง€์—ญ ์„ค๋ช…์„ ์ •ํ™•ํ•˜๊ฒŒ ์ดํ•ดํ•˜๊ณ  ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ bounding box์— ๋Œ€ํ•ด ์ •๊ทœํ™” ๊ณผ์ •([0, 1000) ๋ฒ”์œ„ ๋‚ด)์ด ์ ์šฉ๋˜๊ณ  ์ง€์ •๋œ ๋ฌธ์ž์—ด ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค: โ€œ(X_topleft, Y_topleft),(X_bottomright, Y_bottomright)โ€. ์ด ๋ฌธ์ž์—ด์€ ํ…์ŠคํŠธ๋กœ ํ† ํฐํ™”๋˜๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์œ„์น˜ vocabulary๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. detection ๋ฌธ์ž์—ด๊ณผ ์ผ๋ฐ˜ ํ…์ŠคํŠธ ๋ฌธ์ž์—ด์„ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด, ๋‘ ๊ฐœ์˜ ํŠน์ˆ˜ ํ† ํฐ(์™€ )์ด bounding box ๋ฌธ์ž์—ด์˜ ์‹œ์ž‘๊ณผ ๋์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ bounding box๋ฅผ ํ•ด๋‹นํ•˜๋Š” ์„ค๋ช… ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ๊ณผ ์ ์ ˆํžˆ ์—ฐ๊ด€์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๋˜ ๋‹ค๋ฅธ ํŠน์ˆ˜ ํ† ํฐ set(์™€ )์ด ๋„์ž…๋˜์–ด bounding box๊ฐ€ ์ฐธ์กฐํ•˜๋Š” ๋‚ด์šฉ์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  1. ํ›ˆ๋ จ

๊ทธ๋ฆผ 3์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด, Qwen-VL ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ๊ณผ์ •์€ ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: ๋‘ ๋‹จ๊ณ„์˜ pre-training๊ณผ ๋งˆ์ง€๋ง‰ instruction fine-tuning ํ›ˆ๋ จ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.

3.1 Pre-training

์ฒซ ๋ฒˆ์งธ pre-training ๋‹จ๊ณ„์—์„œ๋Š” ์ฃผ๋กœ ๋Œ€๊ทœ๋ชจ์˜ weakly labeled, ์›น์—์„œ ํฌ๋กค๋ง๋œ image-text pair set์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์€ ์—ฌ๋Ÿฌ ๊ณต๊ฐœ์ ์œผ๋กœ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์†Œ์Šค์™€ ์ผ๋ถ€ in-house ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํŠน์ • ํŒจํ„ด์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ–ˆ์Šต๋‹ˆ๋‹ค. ํ‘œ 2์— ์š”์•ฝ๋œ ๋ฐ”์™€ ๊ฐ™์ด, ์›๋ž˜ ๋ฐ์ดํ„ฐ์…‹์€ ์ด 50์–ต ๊ฐœ์˜ image-text pair๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ •๋ฆฌ ํ›„ 14์–ต ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์•˜๊ณ , ๊ทธ ์ค‘ 77.3%๊ฐ€ ์˜์–ด(ํ…์ŠคํŠธ) ๋ฐ์ดํ„ฐ, 22.7%๊ฐ€ ์ค‘๊ตญ์–ด(ํ…์ŠคํŠธ) ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

์ด ๋‹จ๊ณ„์—์„œ๋Š” large language model์„ ๋™๊ฒฐํ•˜๊ณ  vision encoder์™€ VL adapter๋งŒ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” 224 ร— 224๋กœ ํฌ๊ธฐ๊ฐ€ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ชฉ์ ์€ ํ…์ŠคํŠธ ํ† ํฐ์˜ cross-entropy๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ตœ๋Œ€ learning rate๋Š” 2e^-4์ด๋ฉฐ, ํ›ˆ๋ จ ๊ณผ์ •์€ image-text pair์— ๋Œ€ํ•ด batch size 30720์„ ์‚ฌ์šฉํ•˜๊ณ , ์ „์ฒด ์ฒซ ๋ฒˆ์งธ pre-training ๋‹จ๊ณ„๋Š” 50,000 step ๋™์•ˆ ์ง€์†๋˜์–ด ์•ฝ 15์–ต ๊ฐœ์˜ image-text sample์„ ์†Œ๋น„ํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ hyperparameter๋Š” ๋ถ€๋ก C์— ์ž์„ธํžˆ ๋‚˜์™€ ์žˆ๊ณ , ์ด ๋‹จ๊ณ„์˜ ์ˆ˜๋ ด ๊ณก์„ ์€ ๊ทธ๋ฆผ 6์— ๋‚˜ํƒ€๋‚˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3.2 Multi-task Pre-training

๋‘ ๋ฒˆ์งธ multi-task pre-training ๋‹จ๊ณ„์—์„œ๋Š” ๋” ํฐ ์ž…๋ ฅ ํ•ด์ƒ๋„์™€ interleaved image-text ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ’ˆ์งˆ์˜ ์„ธ๋ฐ€ํ•œ VL annotation ๋ฐ์ดํ„ฐ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ํ‘œ 3์— ์š”์•ฝ๋œ ๋ฐ”์™€ ๊ฐ™์ด, 7๊ฐœ์˜ ์ž‘์—…์„ ๋™์‹œ์— Qwen-VL์—์„œ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ƒ์„ฑ์˜ ๊ฒฝ์šฐ, LLM์˜ ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด in-house ์ˆ˜์ง‘๋œ corpus๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Captioning ๋ฐ์ดํ„ฐ๋Š” LAION-COCO๋ฅผ ์ œ์™ธํ•˜๊ณ  ํ›จ์”ฌ ์ ์€ sample๋กœ ํ‘œ 2์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. VQA ์ž‘์—…์„ ์œ„ํ•ด GQA (Hudson and Manning, 2019), VGQA (Krishna et al., 2017), VQAv2 (Goyal et al., 2017), DVQA (Kafle et al., 2018), OCR-VQA (Mishra et al., 2019) ๋ฐ DocVQA (Mathew et al., 2021)๋ฅผ ํฌํ•จํ•˜๋Š” ๊ณต๊ฐœ์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์˜ ํ˜ผํ•ฉ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Kosmos-2๋ฅผ ๋”ฐ๋ผ grounding ์ž‘์—…์„ ์œ„ํ•ด ์•ฝ๊ฐ„์˜ ์ˆ˜์ •๊ณผ ํ•จ๊ป˜ GRIT (Peng et al., 2023) ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. reference grounding๊ณผ grounded captioning duality ์ž‘์—…์˜ ๊ฒฝ์šฐ, GRIT (Peng et al., 2023), Visual Genome (Krishna et al., 2017), RefCOCO (Kazemzadeh et al., 2014), RefCOCO+, RefCOCOg (Mao et al., 2016)์—์„œ ํ›ˆ๋ จ sample์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ง€ํ–ฅ ์ž‘์—…์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด Common Crawl์—์„œ PDF์™€ HTML ํ˜•์‹ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  (Kim et al., 2022)๋ฅผ ๋”ฐ๋ผ ์ž์—ฐ ํ’๊ฒฝ ๋ฐฐ๊ฒฝ์œผ๋กœ ์˜์–ด์™€ ์ค‘๊ตญ์–ด ์–ธ์–ด๋กœ ํ•ฉ์„ฑ OCR ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋™์ผํ•œ ์ž‘์—… ๋ฐ์ดํ„ฐ๋ฅผ ๊ธธ์ด 2048์˜ sequence๋กœ ํŒจํ‚นํ•˜์—ฌ interleaved image-text ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ„๋‹จํžˆ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

visual encoder์˜ ์ž…๋ ฅ ํ•ด์ƒ๋„๋ฅผ 224 ร— 224์—์„œ 448 ร— 448๋กœ ์ฆ๊ฐ€์‹œ์ผœ ์ด๋ฏธ์ง€ ๋‹ค์šด์ƒ˜ํ”Œ๋ง์œผ๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ์ค„์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๋” ๋†’์€ ํ•ด์ƒ๋„์˜ vision transformer์— ๋Œ€ํ•ด window attention๊ณผ global attention์„ ๋ถ€๋ก E.3์—์„œ ablationํ•ฉ๋‹ˆ๋‹ค. large language model์˜ ์ž ๊ธˆ์„ ํ•ด์ œํ•˜๊ณ  ์ „์ฒด ๋ชจ๋ธ์„ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ชฉ์ ์€ pre-training ๋‹จ๊ณ„์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

3.3 Supervised Fine-tuning

์ด ๋‹จ๊ณ„์—์„œ๋Š” instruction fine-tuning์„ ํ†ตํ•ด Qwen-VL ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์„ fine-tuneํ•˜์—ฌ instruction following ๋ฐ ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œ์ผœ ์ƒํ˜ธ์ž‘์šฉ ๊ฐ€๋Šฅํ•œ Qwen-VL-Chat ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. multi-modal instruction tuning ๋ฐ์ดํ„ฐ๋Š” ์ฃผ๋กœ LLM self-instruction์„ ํ†ตํ•ด ์ƒ์„ฑ๋œ caption ๋ฐ์ดํ„ฐ๋‚˜ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ์—์„œ ๋‚˜์˜ค๋ฉฐ, ์ด๋Š” ์ข…์ข… single-image ๋Œ€ํ™”์™€ ์ถ”๋ก ๋งŒ์„ ๋‹ค๋ฃจ๊ณ  ์ด๋ฏธ์ง€ ๋‚ด์šฉ ์ดํ•ด์— ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. localization๊ณผ multi-image ์ดํ•ด ๋Šฅ๋ ฅ์„ Qwen-VL ๋ชจ๋ธ์— ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด manual annotation, ๋ชจ๋ธ ์ƒ์„ฑ, strategy concatenation์„ ํ†ตํ•ด ์ถ”๊ฐ€ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ set์„ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์ด๋Ÿฌํ•œ ๋Šฅ๋ ฅ์„ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ์–ธ์–ด์™€ ์งˆ๋ฌธ ์œ ํ˜•์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ์ „์ดํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ›ˆ๋ จ ์ค‘์— multi-modal๊ณผ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ๋Œ€ํ™” ๋Šฅ๋ ฅ์—์„œ ๋ชจ๋ธ์˜ ๋ณดํŽธ์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. instruction tuning ๋ฐ์ดํ„ฐ๋Š” 350k๊ฐœ์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” visual encoder๋ฅผ ๋™๊ฒฐํ•˜๊ณ  language model๊ณผ adapter module์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ํ˜•์‹์€ ๋ถ€๋ก B.2์—์„œ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  1. ํ‰๊ฐ€

๋ณธ ์„น์…˜์—์„œ๋Š” ๋‹ค์–‘ํ•œ multi-modal ์ž‘์—…์— ๋Œ€ํ•œ ์ „๋ฐ˜์ ์ธ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์‹œ๊ฐ์  ์ดํ•ด ๋Šฅ๋ ฅ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ดํ•˜์—์„œ Qwen-VL์€ multi-task ํ›ˆ๋ จ ํ›„์˜ ๋ชจ๋ธ์„ ์˜๋ฏธํ•˜๊ณ , Qwen-VL-Chat์€ supervised fine-tuning (SFT) ๋‹จ๊ณ„ ํ›„์˜ ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ 9๋Š” ์‚ฌ์šฉ๋œ ํ‰๊ฐ€ benchmark์™€ ํ•ด๋‹น metric์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ ์š”์•ฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

4.1 Image Caption ๋ฐ ์ผ๋ฐ˜์ ์ธ Visual Question Answering

Image caption๊ณผ ์ผ๋ฐ˜์ ์ธ visual question answering (VQA)์€ vision-language ๋ชจ๋ธ์„ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๊ธฐ์กด ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, image caption์€ ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์„ ์ƒ์„ฑํ•˜๋„๋ก ์š”๊ตฌํ•˜๊ณ , ์ผ๋ฐ˜์ ์ธ VQA๋Š” ์ฃผ์–ด์ง„ image-question pair์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋„๋ก ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.

image caption ์ž‘์—…์˜ ๊ฒฝ์šฐ, Nocaps (Agrawal et al., 2019)์™€ Flickr30K (Young et al., 2014)๋ฅผ benchmark๋กœ ์„ ํƒํ•˜๊ณ  CIDEr score (Vedantam et al., 2015)๋ฅผ metric์œผ๋กœ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. โ€œDescribe the image in English:โ€๋ผ๋Š” prompt๋กœ caption ์ƒ์„ฑ์— greedy search๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ VQA์˜ ๊ฒฝ์šฐ, VQAv2 (Goyal et al., 2017), OKVQA (Marino et al., 2019), GQA (Hudson and Manning, 2019), ScienceQA (Image Set) (Lu et al., 2022b), VizWiz VQA (Gurari et al., 2018)๋ฅผ ํฌํ•จํ•œ ๋‹ค์„ฏ ๊ฐœ์˜ benchmark๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. VQAv2, OKVQA, GQA, VizWiz VQA์˜ ๊ฒฝ์šฐ, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ณต๊ฐ„์— ์ œ์•ฝ ์—†์ด greedy decoding strategy์™€ โ€œ{question} Answer:โ€๋ผ๋Š” prompt๋กœ ๊ฐœ๋ฐฉํ˜• ๋‹ต๋ณ€ ์ƒ์„ฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ScienceQA์˜ ๊ฒฝ์šฐ, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ๊ฐ€๋Šฅํ•œ ์„ ํƒ์ง€๋กœ ์ œํ•œํ•˜๊ณ (๊ฐœ๋ฐฉํ˜•์ด ์•„๋‹Œ), ๊ฐ€์žฅ ๋†’์€ ์‹ ๋ขฐ๋„๋ฅผ ๊ฐ€์ง„ ์„ ํƒ์ง€๋ฅผ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์œผ๋กœ ์„ ํƒํ•˜๋ฉฐ, Top-1 ์ •ํ™•๋„๋ฅผ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

image caption๊ณผ ์ผ๋ฐ˜์ ์ธ VQA ์ž‘์—…์˜ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์€ ํ‘œ 4์— ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์—์„œ ๋ณด๋“ฏ์ด, Qwen-VL๊ณผ Qwen-VL-Chat ๋ชจ๋‘ ๋‘ ์ž‘์—… ๋ชจ๋‘์—์„œ ์ด์ „ generalist ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ๋ช…๋ฐฑํžˆ ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, zero-shot image caption ์ž‘์—…์—์„œ Qwen-VL์€ Flickr30K karpathy-test split์—์„œ state-of-the-art ์„ฑ๋Šฅ(์ฆ‰, 85.8 CIDEr score)์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ํ›จ์”ฌ ๋งŽ์€ parameter๋ฅผ ๊ฐ€์ง„ ์ด์ „ generalist ๋ชจ๋ธ๋“ค(์˜ˆ: 80B parameter๋ฅผ ๊ฐ€์ง„ Flamingo-80B)์„ ๋Šฅ๊ฐ€ํ•˜๊ธฐ๊นŒ์ง€ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ VQA benchmark์—์„œ๋„ ๋ชจ๋ธ๋“ค์ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ๋šœ๋ ทํ•œ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. VQAv2, OKVQA, GQA benchmark์—์„œ Qwen-VL์€ ๊ฐ๊ฐ 79.5, 58.6, 59.3์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์ตœ๊ทผ ์ œ์•ˆ๋œ LVLM๋“ค์„ ํฐ ํญ์œผ๋กœ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ ์€ Qwen-VL์ด ScienceQA์™€ VizWiz ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ๊ฐ•ํ•œ zero-shot ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

4.2 ํ…์ŠคํŠธ ์ง€ํ–ฅ Visual Question Answering

ํ…์ŠคํŠธ ์ง€ํ–ฅ ์‹œ๊ฐ์  ์ดํ•ด๋Š” ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ์‘์šฉ ์ „๋ง์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. TextVQA (Sidorov et al., 2020), DocVQA (Mathew et al., 2021), ChartQA (Masry et al., 2022), AI2Diagram (Kembhavi et al., 2016), OCR-VQA (Mishra et al., 2019)๋ฅผ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ benchmark์—์„œ ํ…์ŠคํŠธ ์ง€ํ–ฅ visual question answering์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฒฐ๊ณผ๋Š” ํ‘œ 5์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ generalist ๋ชจ๋ธ๋“ค๊ณผ ์ตœ๊ทผ LVLM๋“ค์— ๋น„ํ•ด, ๋ชจ๋ธ๋“ค์ด ๋Œ€๋ถ€๋ถ„์˜ benchmark์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์ข…์ข… ํฐ ํญ์œผ๋กœ ์•ž์„œ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

4.3 Refer Expression Comprehension

RefCOCO (Kazemzadeh et al., 2014), RefCOCOg (Mao et al., 2016), RefCOCO+ (Mao et al., 2016), GRIT (Gupta et al., 2022)์™€ ๊ฐ™์€ refer expression comprehension benchmark๋ฅผ ํ‰๊ฐ€ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ธ๋ฐ€ํ•œ ์ด๋ฏธ์ง€ ์ดํ•ด์™€ localization ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, refer expression comprehension ์ž‘์—…์€ ๋ชจ๋ธ์ด ์„ค๋ช…์˜ ์•ˆ๋‚ดํ•˜์— ๋Œ€์ƒ ๊ฐ์ฒด๋ฅผ localizeํ•˜๋„๋ก ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ํ‘œ 6์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ generalist ๋ชจ๋ธ๋“ค์ด๋‚˜ ์ตœ๊ทผ LVLM๋“ค์— ๋น„ํ•ด, ๋ชจ๋ธ๋“ค์ด ๋ชจ๋“  benchmark์—์„œ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

4.4 Vision-Language ์ž‘์—…์—์„œ์˜ Few-shot Learning

๋ชจ๋ธ์€ ๋งŒ์กฑ์Šค๋Ÿฌ์šด in-context learning(a.k.a., few-shot learning) ๋Šฅ๋ ฅ๋„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 4์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด, Qwen-VL์€ ๋น„์Šทํ•œ ์ˆ˜์˜ parameter๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ๋“ค(Flamingo-9B(Alayrac et al., 2022), OpenFlamingo-9B, IDEFICS-9B)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ OKVQA (Marino et al., 2019), Vizwiz (Gurari et al., 2018), TextVQA (Sidorov et al., 2020), Flickr30k (Young et al., 2014)์—์„œ in-context few-shot learning์„ ํ†ตํ•ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. Qwen-VL์˜ ์„ฑ๋Šฅ์€ ํ›จ์”ฌ ํฐ ๋ชจ๋ธ๋“ค(Flamingo-80B์™€ IDEFICS-80B)๊ณผ๋„ ๋น„๊ตํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ์„ฑ๋  ์ˆ˜ ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  RICES (Yang et al., 2022b)์™€ ๊ฐ™์€ ์ •๊ตํ•œ few-shot exemplar ๊ตฌ์„ฑ ๋ฐฉ๋ฒ•์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  naive random sample์„ ์ฑ„ํƒํ•˜์—ฌ few-shot exemplar๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค๋Š” ์ ์„ ์ฐธ๊ณ ํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

4.5 ์‹ค์ œ ์‚ฌ์šฉ์ž ํ–‰๋™์—์„œ์˜ Instruction Following

์ด์ „์˜ ๊ธฐ์กด vision-language ํ‰๊ฐ€ ์™ธ์—๋„, ์‹ค์ œ ์‚ฌ์šฉ์ž ํ–‰๋™ ํ•˜์—์„œ Qwen-VL-Chat ๋ชจ๋ธ์˜ capacity๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด TouchStone (Bai et al., 2023), SEED-Bench (Li et al., 2023b), MME (Fu et al., 2023)์— ๋Œ€ํ•œ ํ‰๊ฐ€๋ฅผ ์ถ”๊ฐ€๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. TouchStone์€ ๊ฐœ๋ฐฉํ˜• vision-language instruction-following benchmark์ž…๋‹ˆ๋‹ค. TouchStone benchmark์—์„œ ์˜์–ด์™€ ์ค‘๊ตญ์–ด ๋ชจ๋‘์—์„œ ๋‹ค๋ฅธ instruction-tuned LVLM๋“ค๊ณผ Qwen-VL-Chat์˜ instruction-following ๋Šฅ๋ ฅ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. SEED-Bench๋Š” Multimodal LLM์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ •ํ™•ํ•œ ์ธ๊ฐ„ annotation์ด ์žˆ๋Š” 19K๊ฐœ์˜ ๊ฐ๊ด€์‹ ๋ฌธ์ œ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ณต๊ฐ„์  ๋ฐ ์‹œ๊ฐ„์  ์ดํ•ด๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” 12๊ฐœ์˜ ํ‰๊ฐ€ ์ฐจ์›์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. MME๋Š” ์ด 14๊ฐœ์˜ subtask์—์„œ perception๊ณผ cognition ๋Šฅ๋ ฅ์„ ๋ชจ๋‘ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

์„ธ benchmark์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ํ‘œ 7์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. Qwen-VL-Chat์€ ์„ธ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ ๋‹ค๋ฅธ LVLM๋“ค์— ๋น„ํ•ด ๋ช…๋ฐฑํ•œ ์žฅ์ ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์‚ฌ์šฉ์ž instruction์„ ์ดํ•ดํ•˜๊ณ  ๋‹ต๋ณ€ํ•˜๋Š” ๋ฐ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. SEED-Bench์—์„œ๋Š” ๋‹จ์ˆœํžˆ ๋„ค ๊ฐœ์˜ frame์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋ชจ๋ธ์˜ ์‹œ๊ฐ์  ๋Šฅ๋ ฅ์ด ๋น„๋””์˜ค ์ž‘์—…์— ํšจ๊ณผ์ ์œผ๋กœ ์ „์ด๋  ์ˆ˜ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. TouchStone์—์„œ ์ œ์‹œ๋œ ์ „๋ฐ˜์ ์ธ ์ ์ˆ˜ ๋ฉด์—์„œ, ๋ชจ๋ธ์€ ๋‹ค๋ฅธ LVLM๋“ค์— ๋น„ํ•ด ํŠนํžˆ ์ค‘๊ตญ์–ด ๋Šฅ๋ ฅ์—์„œ ๋ช…ํ™•ํ•œ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋Šฅ๋ ฅ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ๋ฒ”์ฃผ ๋ฉด์—์„œ, ๋ชจ๋ธ์€ ์ดํ•ด์™€ ์ธ์‹์—์„œ ๋” ๋‘๋“œ๋Ÿฌ์ง„ ์žฅ์ ์„ ๋ณด์ด๋ฉฐ, ํŠนํžˆ ํ…์ŠคํŠธ ์ธ์‹๊ณผ ์ฐจํŠธ ๋ถ„์„๊ณผ ๊ฐ™์€ ์˜์—ญ์—์„œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๋” ์ž์„ธํ•œ ์ •๋ณด๋Š” TouchStone ๋ฐ์ดํ„ฐ์…‹์„ ์ฐธ์กฐํ•˜์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

  1. ๊ด€๋ จ ์—ฐ๊ตฌ

์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ์—ฐ๊ตฌ์ž๋“ค์€ vision-language learning์— ์ƒ๋‹นํ•œ ๊ด€์‹ฌ์„ ๋ณด์—ฌ์™”์œผ๋ฉฐ, ํŠนํžˆ multi-task generalist ๋ชจ๋ธ ๊ฐœ๋ฐœ์—์„œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. CoCa (Yu et al., 2022)๋Š” image-text retrieval๊ณผ vision-language ์ƒ์„ฑ ์ž‘์—…์„ ๋™์‹œ์— ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด encoder-decoder ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. OFA (Wang et al., 2022a)๋Š” ์‚ฌ์šฉ์ž ์ง€์ • ์ž‘์—… instruction์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • vision-language ์ž‘์—…์„ sequence-to-sequence ์ž‘์—…์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. Unified I/O (Lu et al., 2022a)๋Š” segmentation๊ณผ depth estimation๊ณผ ๊ฐ™์€ ๋” ๋งŽ์€ ์ž‘์—…์„ ํ†ตํ•ฉ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์—ฐ๊ตฌ ๋ฒ”์ฃผ๋Š” vision-language representation ๋ชจ๋ธ ๊ตฌ์ถ•์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. CLIP (Radford et al., 2021)์€ contrastive learning๊ณผ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ semantic space์—์„œ ์ด๋ฏธ์ง€์™€ ์–ธ์–ด๋ฅผ ์ •๋ ฌํ•˜์—ฌ ๋‹ค์–‘ํ•œ downstream ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. BEIT-3 (Wang et al., 2022b)๋Š” mixture-of-experts (MOE) ๊ตฌ์กฐ์™€ ํ†ตํ•ฉ๋œ masked token prediction objective๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ visual-language ์ž‘์—…์—์„œ state-of-the-art ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. vision-language learning ์™ธ์—๋„, ImageBind (Girdhar et al., 2023)์™€ ONE-PEACE (Wang et al., 2023)๋Š” ์Œ์„ฑ๊ณผ ๊ฐ™์€ ๋” ๋งŽ์€ modality๋ฅผ ํ†ตํ•ฉ๋œ semantic space๋กœ ์ •๋ ฌํ•˜์—ฌ ๋” ์ผ๋ฐ˜์ ์ธ representation ๋ชจ๋ธ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์ƒ๋‹นํ•œ ์ง„์ „์„ ๋‹ฌ์„ฑํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์ด์ „ vision-language ๋ชจ๋ธ๋“ค์€ ์—ฌ์ „ํžˆ instruction following์—์„œ์˜ ๋‚ฎ์€ ๊ฒฌ๊ณ ์„ฑ, ๋ฏธ์ง€์˜ ์ž‘์—…์—์„œ ์ œํ•œ๋œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ, in-context ๋Šฅ๋ ฅ์˜ ๋ถ€์กฑ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Large Language Model (LLM)์˜ ๊ธ‰์†ํ•œ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜, ์—ฐ๊ตฌ์ž๋“ค์€ LLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋” ๊ฐ•๋ ฅํ•œ large vision-language model (LVLM)์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. BLIP-2 (Li et al., 2023c)๋Š” ๋™๊ฒฐ๋œ vision foundation ๋ชจ๋ธ๊ณผ LLM์„ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•ด Q-Former๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํ•œํŽธ, LLAVA (Liu et al., 2023)์™€ MiniGPT4 (Zhu et al., 2023)๋Š” LVLM์—์„œ instruction following ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด visual instruction tuning์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, mPLUG-DocOwl (Ye et al., 2023a)์€ ๋””์ง€ํ„ธ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋„์ž…ํ•˜์—ฌ LVLM์— ๋ฌธ์„œ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. Kosmos2 (Peng et al., 2023), Shikra (Chen et al., 2023a), BuboGPT (Zhao et al., 2023)๋Š” visual grounding ๋Šฅ๋ ฅ์œผ๋กœ LVLM์„ ๋”์šฑ ํ–ฅ์ƒ์‹œ์ผœ ์ง€์—ญ ์„ค๋ช…๊ณผ localization์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” image captioning, visual question answering, OCR, document understanding, visual grounding ๋Šฅ๋ ฅ์„ Qwen-VL์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์˜ ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  1. ๊ฒฐ๋ก  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ

multimodal ์—ฐ๊ตฌ๋ฅผ ์ด‰์ง„ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋‹ค๊ตญ์–ด vision-language ๋ชจ๋ธ ์„ธํŠธ์ธ Qwen-VL ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์ถœ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. Qwen-VL์€ ๋‹ค์–‘ํ•œ benchmark์—์„œ ๋น„์Šทํ•œ ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ๋‹ค๊ตญ์–ด ๋Œ€ํ™”, multi-image interleaved ๋Œ€ํ™”, ์ค‘๊ตญ์–ด grounding, ์„ธ๋ฐ€ํ•œ ์ธ์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์ฐจ์›์—์„œ Qwen-VL์˜ ๋Šฅ๋ ฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ์ „๋…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

โ€ข ์Œ์„ฑ ๋ฐ ๋น„๋””์˜ค์™€ ๊ฐ™์€ ๋” ๋งŽ์€ modality์™€ Qwen-VL์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
โ€ข ๋ชจ๋ธ ํฌ๊ธฐ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๋ฐ ๋” ๋†’์€ ํ•ด์ƒ๋„๋ฅผ ํ™•์žฅํ•˜์—ฌ Qwen-VL์„ ์ฆ๊ฐ•ํ•˜๊ณ , multimodal ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ ๋” ๋ณต์žกํ•˜๊ณ  ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข ํŠนํžˆ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€์™€ ์œ ์ฐฝํ•œ ์Œ์„ฑ ์ƒ์„ฑ์—์„œ multi-modal ์ƒ์„ฑ์— ๋Œ€ํ•œ Qwen-VL์˜ ๊ธฐ๋Ÿ‰์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.



-->