[Paper Review] LLM-Det : Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Posted by Euisuk's Dev Log on September 15, 2025

[Paper Review] LLM-Det : Learning Strong Open-Vocabulary Object Detectors under the

Supervision of Large Language Models

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-Learning-Strong-Open-Vocabulary-Object-Detectors-under-theSupervision-of-Large-Language-Models

https://arxiv.org/pdf/2501.18954

๋ณธ ๋ฆฌ๋ทฐ๋Š” ์›๋ฌธ์„ ์ตœ๋Œ€ํ•œ ์ง์—ญํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ โ€œ์šฐ๋ฆฌ๋Š”โ€์€ ์ €์ž๋ฅผ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

Abstract

์ตœ๊ทผ open-vocabulary detector๋“ค์€ ํ’๋ถ€ํ•œ region-level ์ฃผ์„ ๋ฐ์ดํ„ฐ๋กœ ์œ ๋งํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) Region-level ์ฃผ์„ ๋ฐ์ดํ„ฐ๋ž€?

  • Region-level ์ฃผ์„ ๋ฐ์ดํ„ฐ๋Š” ์ด๋ฏธ์ง€์—์„œ ํŠน์ • ์˜์—ญ(bounding box)๊ณผ ๊ทธ ์˜์—ญ์— ํ•ด๋‹นํ•˜๋Š” ํ…์ŠคํŠธ ์„ค๋ช…์„ ์—ฐ๊ฒฐํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด:
    • ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ์ด ์žˆ๋Š” ์˜์—ญ โ†’ โ€œpersonโ€
    • ์ด๋ฏธ์ง€์—์„œ ์ž๋™์ฐจ๊ฐ€ ์žˆ๋Š” ์˜์—ญ โ†’ โ€œcarโ€
    • ์ด๋ฏธ์ง€์—์„œ ๋‚˜๋ฌด๊ฐ€ ์žˆ๋Š” ์˜์—ญ โ†’ โ€œtreeโ€

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” large language model๊ณผ ํ•จ๊ป˜ ๊ฐ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ image-level ์ƒ์„ธ caption์„ ์ƒ์„ฑํ•˜์—ฌ co-trainingํ•˜๋Š” open-vocabulary detector๊ฐ€ ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) Region-level vs Image-level

  • Region-level๋งŒ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ:
    • โ€œ์‚ฌ๋žŒโ€, โ€œ์ฃผ๋ฐฉโ€, โ€œ์ ‘์‹œโ€
  • Image-level caption ์ถ”๊ฐ€ํ•œ ๊ฒฝ์šฐ:
    • โ€œ์ด๋ฏธ์ง€์—๋Š” ๋‘ ์‚ฌ๋žŒ์ด ์ฃผ๋ฐฉ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์™ผ์ชฝ ์‚ฌ๋žŒ์€ ๋นจ๊ฐ„์ƒ‰, ํŒŒ๋ž€์ƒ‰, ํฐ์ƒ‰ ๋ฌด๋Šฌ์˜ ์ฒดํฌ ์…”์ธ ๋ฅผ ์ž…๊ณ  ์žˆ์œผ๋ฉฐโ€ฆ ์˜ค๋ฅธ์ชฝ ์‚ฌ๋žŒ์€ ์ง„ํ•œ ํŒŒ๋ž€์ƒ‰ ํ‹ฐ์…”์ธ ๋ฅผ ์ž…๊ณ  ์‹ฑํฌ๋Œ€์—์„œ ์„ค๊ฑฐ์ง€๋ฅผ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹คโ€ฆโ€

์ด ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € GroundingCap-1M ๋ฐ์ดํ„ฐ์…‹์„ ์ˆ˜์ง‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ ์ด๋ฏธ์ง€๋Š” ๊ด€๋ จ grounding label๊ณผ image-level ์ƒ์„ธ caption์ด ํ•จ๊ป˜ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ํ‘œ์ค€ grounding loss์™€ caption generation loss๋ฅผ ํฌํ•จํ•œ ํ›ˆ๋ จ ๋ชฉํ‘œ๋กœ open-vocabulary detector๋ฅผ fine-tuningํ•ฉ๋‹ˆ๋‹ค.

Large language model์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๊ด€์‹ฌ ์˜์—ญ์— ๋Œ€ํ•œ region-level ์งง์€ caption๊ณผ ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ image-level ๊ธด caption์„ ๋ชจ๋‘ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Large language model์˜ ์ง€๋„ํ•˜์— ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ detector์ธ LLMDet์€ ๊ธฐ์ค€์„ ์„ ๋ช…ํ™•ํ•œ ์ฐจ์ด๋กœ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ๋›ฐ์–ด๋‚œ open-vocabulary ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐœ์„ ๋œ LLMDet์ด ๋” ๊ฐ•๋ ฅํ•œ large multi-modal model์„ ๊ตฌ์ถ•ํ•˜์—ฌ ์ƒํ˜ธ ์ด์ต์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  1. Introduction

Open-vocabulary object detection์€ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์˜ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž„์˜์˜ ํด๋ž˜์Šค๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฉฐ, ์ด๋Š” ์ „ํ†ต์ ์ธ closed-set object detection๋ณด๋‹ค ๋” ์ผ๋ฐ˜์ ์ธ ํƒ์ง€ ์ž‘์—…์ž…๋‹ˆ๋‹ค.

  • GLIP์€ region-word contrastive pre-training์„ ํ†ตํ•ด object detection๊ณผ phrase grounding์„ ์ฒ˜์Œ์œผ๋กœ ํ†ตํ•ฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณต์‹ํ™”๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ๊ฐœ๋…์„ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋Œ€ํ•œ grounding ๋ฐ image-text ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ด์ต์„ ์–ป์–ด ํ•™์Šต๋œ ํ‘œํ˜„์„ ์˜๋ฏธ๋ก ์ ์œผ๋กœ ํ’๋ถ€ํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

https://arxiv.org/pdf/2112.03857

ํ›„์† ์—ฐ๊ตฌ๋“ค์€ ํšจ๊ณผ์ ์ธ vision-language fusion๊ณผ ์„ธ์‹ฌํ•˜๊ฒŒ ์„ค๊ณ„๋œ word embedding ๋ฐ negative sample์„ ํ†ตํ•œ ์„ธ๋ฐ€ํ•œ region-word alignment์— ์ดˆ์ ์„ ๋งž์ท„์Šต๋‹ˆ๋‹ค. Pretraining ๋ฐ์ดํ„ฐ์™€ ์—ฐ์‚ฐ์„ ํ™•์žฅํ•จ์œผ๋กœ์จ ๊ธฐ์กด open-vocabulary object detector๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋†€๋ผ์šด zero-shot ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ grounding ์ž‘์—…์„ ๋‹ค๋ฅธ ์–ธ์–ด ์ž‘์—…๊ณผ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด language knowledge๋กœ ์‹œ๊ฐ์  ํ‘œํ˜„์„ ํ’๋ถ€ํ•˜๊ฒŒ ํ•˜์—ฌ ๋” ๊ฐ•๋ ฅํ•œ open-vocabulary detector๋ฅผ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  • GLIPv2๋Š” grounding loss์™€ masked language modeling loss ํ•˜์—์„œ ๋ชจ๋ธ์„ pre-trainํ•ฉ๋‹ˆ๋‹ค. ์ด์–ด์„œ CapDet๊ณผ DetCLIPv3๋Š” dense captioning๊ณผ grounding์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ๋„ open-vocabulary ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

GLIPv2 - https://arxiv.org/pdf/2206.05836

CapDet - https://arxiv.org/pdf/2303.02489

DetCLIPv3 - https://arxiv.org/pdf/2404.09216

ํ•˜์ง€๋งŒ ์ด๋“ค์€ ๊ฐ object์— ๋Œ€ํ•ด ์งง์€ caption์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: ๊ฑฐ์นœ ์„ค๋ช…๊ณผ ๊ณ„์ธต์  ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”). ์ด๋Š” ๊ฑฐ์น ๊ณ  ๊ฐœ๋ณ„์ ์ด๋ฉฐ ๊ฐ์ฒด ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฐ˜๋ฉด, ํ’๋ถ€ํ•œ ์„ธ๋ถ€์‚ฌํ•ญ๊ณผ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํฌ๊ด„์  ์ดํ•ด๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธด image-level caption์€ ์งง์€ region-level ์„ค๋ช…๋ณด๋‹ค ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

2.1. Open-Vocabulary Object Detection

Open-vocabulary object detection(OVD)์—์„œ detector๋Š” ์ œํ•œ๋œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋˜์ง€๋งŒ ์ž„์˜์˜ ํ…Œ์ŠคํŠธ ์‹œ์  ์‚ฌ์šฉ์ž ์ž…๋ ฅ ํด๋ž˜์Šค๋ฅผ ํƒ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ž„์˜์˜ ํด๋ž˜์Šค๋ฅผ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด open-vocabulary object detection์€ ํด๋ž˜์Šค ์ด๋ฆ„์œผ๋กœ ๋ณด์ง€ ๋ชปํ•œ ํด๋ž˜์Šค๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก vision-language ์ž‘์—…์œผ๋กœ ๊ณต์‹ํ™”๋ฉ๋‹ˆ๋‹ค.

  • CLIP์˜ ์ธ์ƒ์ ์ธ zero-shot ๋Šฅ๋ ฅ์— ๋™๊ธฐ๋ฅผ ๋ฐ›์•„, detector๋ฅผ CLIP๊ณผ ์ •๋ ฌํ•˜๊ฑฐ๋‚˜ CLIP์„ ๋ชจ๋ธ์˜ ์ผ๋ถ€๋กœ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด OVD๋ฅผ ๋‹ค๋ฃจ๋Š” ์ง์ ‘์ ์ธ ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ CLIP์€ image-level ๋ชฉํ‘œ๋กœ pre-train๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— CLIP์˜ ํŠน์ง•์ด OVD์— ์™„๋ฒฝํ•˜๊ฒŒ ์ ํ•ฉํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

๋Œ€์•ˆ์ ์œผ๋กœ, ๋‹ค์–‘ํ•œ ๋ฆฌ์†Œ์Šค๋กœ๋ถ€ํ„ฐ์˜ ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋กœ object-aware visual-language ๊ณต๊ฐ„์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ์ธ์ƒ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ masked language modeling์ด๋‚˜ dense captioning๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์–ธ์–ด ์ž‘์—…๊ณผ์˜ multi-task learning์ด ๋” ๋‚˜์€ vision-language alignment๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์–ด detector์˜ open-vocabulary ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

2.2. Large Vision-Language Model

์ตœ๊ทผ large vision-language model๋“ค์€ large language model์— ๋›ฐ์–ด๋‚œ ์‹œ๊ฐ์  ์ธ์‹ ๋ฐ ์ดํ•ด ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ large vision-language model์€ ์„ธ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: (1) vision token์„ ์ถ”์ถœํ•˜๋Š” vision foundation model, (2) vision feature๋ฅผ language ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” projector, ๊ทธ๋ฆฌ๊ณ  (3) ์‹œ๊ฐ์  ๋ฐ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ๋ชจ๋‘ ์ดํ•ดํ•˜๋Š” large language model์ž…๋‹ˆ๋‹ค.

  1. GroundingCap-1M ๋ฐ์ดํ„ฐ์…‹

Data Formulation

LLMDet ํ›ˆ๋ จ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ํ›ˆ๋ จ ์ƒ˜ํ”Œ์„ quadruple (I, Tg, B, Tc)๋กœ ๊ณต์‹ํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ์—ฌ๊ธฐ์„œ I๋Š” ์ด๋ฏธ์ง€, Tg๋Š” ์งง์€ grounding ํ…์ŠคํŠธ, B๋Š” grounding ํ…์ŠคํŠธ์˜ ๊ตฌ๋ฌธ์— ๋งคํ•‘๋œ ์ฃผ์„์ด ์žˆ๋Š” bounding box๋“ค, Tc๋Š” ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ caption์ž…๋‹ˆ๋‹ค.

์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ caption์„ ์ˆ˜์ง‘ํ•  ๋•Œ ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ์›์น™์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. Caption์€ ๊ฐ€๋Šฅํ•œ ํ•œ ๋งŽ์€ ์„ธ๋ถ€์‚ฌํ•ญ์„ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Caption์ด object ์œ ํ˜•, ์งˆ๊ฐ, ์ƒ‰์ƒ, object์˜ ๋ถ€๋ถ„, object ๋™์ž‘, ์ •ํ™•ํ•œ object ์œ„์น˜, ์ด๋ฏธ์ง€์˜ ํ…์ŠคํŠธ๋ฅผ ์„ค๋ช…ํ•˜์—ฌ ์ •๋ณด๊ฐ€ ํ’๋ถ€ํ•˜๋„๋ก ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

  1. Caption์€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์‚ฌ์‹ค์  ์„ธ๋ถ€์‚ฌํ•ญ๋งŒ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋„ˆ๋ฌด ๋งŽ์€ ์ƒ์ƒ์ ์ด๊ฑฐ๋‚˜ ์ถ”๋ก ์ ์ธ caption์€ ์ •๋ณด ๋ฐ€๋„๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๊ฑฐ๋‚˜ ๋ชจ๋ธ ํ•™์Šต์„ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Dataset Construction

๊ตฌ์ถ• ๋น„์šฉ์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด bounding box๋‚˜ ์ƒ์„ธํ•œ caption์ด ์žˆ๋Š” ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ์—ฐ๊ตฌ๋“ค์„ ๋”ฐ๋ผ object detection ๋ฐ์ดํ„ฐ์…‹, grounding ๋ฐ์ดํ„ฐ์…‹, image-text ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.

GroundingCap-1M์€ ์—ฌ๋Ÿฌ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์„ ์กฐํ•ฉํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋งŒ๋“  ํ†ตํ•ฉ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

  • Object detection ๋ฐ์ดํ„ฐ์…‹ โ†’ caption ์ถ”๊ฐ€
  • Grounding ๋ฐ์ดํ„ฐ์…‹ โ†’ ์ƒ์„ธํ•œ caption ์ถ”๊ฐ€
  • Image-text ๋ฐ์ดํ„ฐ์…‹ โ†’ bounding box ์ถ”๊ฐ€

Object detection ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ:

  • COCO์™€ V3Det ๋ฐ์ดํ„ฐ์…‹์„ ์„ ํƒ
  • COCO์˜ ์ƒ์„ธํ•œ caption์€ ShareGPT4V(168k)์™€ ASv2(42k)์—์„œ ์ˆ˜์ง‘
  • V3Det์˜ caption์€ Qwen2-VL-72b๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ

Grounding ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ:

  • GQA์™€ Flickr30k๋ฅผ ํฌํ•จํ•˜๋Š” GoldG๋ฅผ ์„ ํƒ
  • ๊ณ„์‚ฐ์„ ์ ˆ์•ฝํ•˜๊ณ  negative๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋™์ผํ•œ ์ด๋ฏธ์ง€์˜ ์ผ๋ถ€ grounding ํ…์ŠคํŠธ๋ฅผ ๋ณ‘ํ•ฉ
  • ์ƒ์„ธํ•œ caption๋„ Qwen2-VL-72b๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ

Image-text ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ:

  • LLaVA-OneVision๊ณผ ShareGPT4V์˜ caption์ด ์žˆ๋Š” LCS-558k ์‚ฌ์šฉ
  • ์ด ๋ฐ์ดํ„ฐ์…‹์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ pseudo box๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋จผ์ € ์ „ํ†ต์ ์ธ ์–ธ์–ด parser๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ caption์—์„œ ๋ช…์‚ฌ๊ตฌ๋ฅผ ์ถ”์ถœํ•œ ๋‹ค์Œ MM Grounding DINO๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๊ตฌ๋ฌธ์— ๋Œ€ํ•œ bounding box๋ฅผ ์ƒ์„ฑ

์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹์ธ GroundingCap-1M์€ 112๋งŒ ๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

Quality Verification

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๊ณผ์ •์—์„œ prompt๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์„ ํƒํ•˜๊ณ  ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์ตœ๊ณ  ๋ชจ๋ธ(Qwen2VL-72b)์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ์…‹์— ์–ด๋А ์ •๋„ ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€ํ”ผํ•˜๋ฏ€๋กœ ๋ช‡ ๊ฐ€์ง€ ํ›„์ฒ˜๋ฆฌ๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋ฐ์ดํ„ฐ์…‹์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

์ ์šฉ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

  1. Caption ๋ชจ๋ธ์ด ์ƒ์ƒ์  ๋‚ด์šฉ์„ ์„ค๋ช…ํ•˜์ง€ ์•Š๋„๋ก promptํ–ˆ์ง€๋งŒ, ๋ชจ๋ธ์ด ์—ฌ์ „ํžˆ โ€œindicatingโ€, โ€œsuggestingโ€, โ€œpossiblyโ€์™€ ๊ฐ™์€ ๋ช…๋ฐฑํ•œ ๋‹จ์–ด๋กœ ์ถœ๋ ฅํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด ์ถ”์ธก์  ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ ํ•˜์œ„ ๋ฌธ์žฅ์„ ์‚ญ์ œ
  2. ์˜๋ฏธ ์—†๋Š” caption์„ ํ•„ํ„ฐ๋งํ•˜๋Š” ๊ทœ์น™ ์„ค๊ณ„
  3. Caption์ด ์„ธ๋ถ€์‚ฌํ•ญ์œผ๋กœ ํ’๋ถ€ํ•˜๋„๋ก ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ ์ƒ์„ฑ๋œ caption์ด 100 token ๋ฏธ๋งŒ์ธ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด Qwen2VL-72b๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ caption์„ ์žฌ์ƒ์„ฑ

ํ›„์ฒ˜๋ฆฌ ํ›„ ๊ฐ caption์€ ํ‰๊ท  ์•ฝ 115๋‹จ์–ด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

  1. Large Language Model์˜ ์ง€๋„ํ•˜์— LLMDet ํ›ˆ๋ จ

์ „์ฒด ์‹œ์Šคํ…œ ๊ตฌ์กฐ

LLMDet์€ 3๊ฐœ์˜ ๋…๋ฆฝ์ ์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋“ˆ์„ ๊ฒฐํ•ฉํ•œ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค:

๐Ÿ” Detector (MM Grounding DINO)

  • ์—ญํ• : ์ด๋ฏธ์ง€ โ†’ ์‹œ๊ฐ์  ํŠน์ง• ์ถ”์ถœ + ๊ฐ์ฒด ํƒ์ง€
  • ์ƒํƒœ: ์ด๋ฏธ ์™„์ „ํžˆ ํ›ˆ๋ จ๋œ ์ƒํƒœ

๐Ÿ”— Projector

  • ์—ญํ• : ์‹œ๊ฐ์  ํŠน์ง• โ†’ ์–ธ์–ด ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
  • ์ƒํƒœ: ์ฒ˜์Œ์—๋Š” ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™” (ํ•™์Šต ํ•„์š”)

๐Ÿค– LLM (Large Language Model)

  • ์—ญํ• : ๋ณ€ํ™˜๋œ ์‹œ๊ฐ์  ํŠน์ง• โ†’ ์ž์—ฐ์–ด caption ์ƒ์„ฑ
  • ์ƒํƒœ: ์ด๋ฏธ ์™„์ „ํžˆ ํ›ˆ๋ จ๋œ ์ƒํƒœ

๋‹จ๊ณ„๋ณ„ ํ›ˆ๋ จ ์ „๋žต

๐Ÿ“ Step 1: Alignment Training (์ •๋ ฌ ํ•™์Šต)

๋ชฉํ‘œ: Detector์™€ LLM ์‚ฌ์ด์˜ โ€œ๋ฒˆ์—ญ๊ธฐโ€ ์—ญํ• ์„ ํ•˜๋Š” Projector ํ•™์Šต

1
์ด๋ฏธ์ง€ โ†’ Detector (๐Ÿ”’๊ณ ์ •) โ†’ Projector (๐Ÿ”„ํ•™์Šต) โ†’ LLM (๐Ÿ”’๊ณ ์ •) โ†’ Caption
  • ํ•™์Šต ๋‚ด์šฉ:

    • ์ž…๋ ฅ: Detector์˜ ์ „์ฒด feature map
    • ์ถœ๋ ฅ: ์ด๋ฏธ์ง€ ์ „์ฒด์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ caption
    • Loss: Language modeling loss๋งŒ ์‚ฌ์šฉ
  • ์™œ Projector๋งŒ ํ•™์Šตํ•˜๋Š”๊ฐ€?

    • Detector์™€ LLM์˜ ๊ธฐ์กด ์ง€์‹์„ ๋ณด์กด
    • ๊ณ„์‚ฐ ํšจ์œจ์„ฑ (์ž‘์€ ๋ชจ๋“ˆ๋งŒ ํ•™์Šต)
    • ํ•™์Šต ์•ˆ์ •์„ฑ ํ™•๋ณด

๐Ÿ“ Step 2: End-to-End Training (ํ†ตํ•ฉ ํ•™์Šต)

๋ชฉํ‘œ: ์ „์ฒด ์‹œ์Šคํ…œ์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ ๋ฐœ์ „

1
์ด๋ฏธ์ง€ โ†’ Detector (๐Ÿ”„ํ•™์Šต) โ†’ Projector (๐Ÿ”„ํ•™์Šต) โ†’ LLM (๐Ÿ”„LoRA) โ†’ Caption

๋™์‹œ์— 3๊ฐ€์ง€ ์ž‘์—… ์ˆ˜ํ–‰:

  1. ๊ธฐ์กด Grounding ์ž‘์—… (Detector ์ฃผ๋„)

    • โ€œyoung manโ€์ด๋ผ๋Š” ํ…์ŠคํŠธ โ†” ํ•ด๋‹น ์˜์—ญ ๋งค์นญ
    • Loss: Lalign+LboxL_{align} + L_{box}Lalignโ€‹+Lboxโ€‹
  2. Image-level Caption Generation (์ „์ฒด ํ˜‘๋ ฅ)

    • ์ „์ฒด ์ด๋ฏธ์ง€ โ†’ โ€œ์ด๋ฏธ์ง€์—๋Š” ๋‘ ์‚ฌ๋žŒ์ด ์ฃผ๋ฐฉ์—์„œโ€ฆโ€
    • Loss: LlmimageL_{lm}^{image}Llmimageโ€‹
  3. Region-level Caption Generation (์„ธ๋ฐ€ํ•œ ๋งค์นญ)

    • ํŠน์ • ์˜์—ญ โ†’ โ€œyoung manโ€, โ€œdishesโ€ ๋“ฑ
    • Loss: LlmregionL_{lm}^{region}Llmregionโ€‹

์™œ ์ด๋Ÿฐ ๋ณต์žกํ•œ ๊ตฌ์กฐ๊ฐ€ ํ•„์š”ํ•œ๊ฐ€?

๐ŸŽฏ Image-level vs Region-level์˜ ์ƒํ˜ธ ๋ณด์™„

  • Image-level๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ ์ด์œ :

    • LLM์ด โ€œdishesโ€๋ผ๊ณ  ๋งํ–ˆ์„ ๋•Œ, ์ด๋ฏธ์ง€์˜ ์–ด๋А ๋ถ€๋ถ„์ธ์ง€ ๋ชจํ˜ธํ•จ
    • ์ „์ฒด์ ์ธ ๋งฅ๋ฝ์€ ์ž˜ ์ดํ•ดํ•˜์ง€๋งŒ ์ •ํ™•ํ•œ ์œ„์น˜ ๋งคํ•‘์ด ์–ด๋ ค์›€
  • Region-level์˜ ํ•„์š”์„ฑ:

    • Object query โ†’ Cross-attention โ†’ Feature map์—์„œ ์ •๋ณด ์ˆ˜์ง‘
    • โ€œ์ด ํŠน์ • ์˜์—ญ์€ ์ •ํ™•ํžˆ โ€˜young manโ€™์ด๋‹คโ€๋ผ๋Š” ๋ช…ํ™•ํ•œ ๋งคํ•‘ ์ œ๊ณต

๐Ÿ”„ Cross-Attention์˜ ์—ญํ• 

  • ๋ฌธ์ œ: Object query ํ•˜๋‚˜๋งŒ์œผ๋กœ๋Š” ์ •๋ณด๊ฐ€ ๋ถ€์กฑ
  • ํ•ด๊ฒฐ: Cross-attention์„ ํ†ตํ•ด ์ „์ฒด feature map์—์„œ ๊ด€๋ จ ์ •๋ณด ์ˆ˜์ง‘
1
Object Query (์ œํ•œ๋œ ์ •๋ณด) + Cross-Attention โ†’ Feature Map (ํ’๋ถ€ํ•œ ์ •๋ณด) โ†’ ์ •ํ™•ํ•œ Region Caption

์ตœ์ข… Loss ํ•จ์ˆ˜

TotalLoss=Lalign+Lbox+Llmimage+LlmregionTotal Loss = L_{align} + L_{box} + L_{lm}^{image} + L_{lm}^{region}TotalLoss=Lalignโ€‹+Lboxโ€‹+Llmimageโ€‹+Llmregionโ€‹

  • Lalign+LboxL_{align} + L_{box}Lalignโ€‹+Lboxโ€‹: ๊ธฐ์กด ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ์œ ์ง€
  • LlmimageL_{lm}^{image}Llmimageโ€‹: ์ „์ฒด์ ์ธ ๋งฅ๋ฝ ์ดํ•ด ๋Šฅ๋ ฅ ํ–ฅ์ƒ
  • LlmregionL_{lm}^{region}Llmregionโ€‹: ์ •ํ™•ํ•œ ์˜์—ญ-๋‹จ์–ด ๋งคํ•‘ ๋Šฅ๋ ฅ ํ–ฅ์ƒ
  1. Experiment

5.1. Implementation Details

MM Grounding DINO(MM-GDINO)๋ฅผ ๊ธฐ์ค€ ๋ชจ๋ธ๋กœ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์™„์ „ํžˆ ์˜คํ”ˆ์†Œ์Šค์ด๋ฉฐ SOTA ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Pre-trained checkpoint๋ฅผ ๋‹ค์‹œ ๋กœ๋“œํ•˜๊ณ  GroundingCap-1M ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ grounding loss์™€ caption generation loss์˜ ์ง€๋„ํ•˜์— ๋ชจ๋ธ์„ fine-tuningํ•ฉ๋‹ˆ๋‹ค.

Large language model์€ LLaVA-OneVision-0.5b-ov์—์„œ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ์™€ ํ›ˆ๋ จ ํšจ์œจ์„ฑ์„ ์ ˆ์•ฝํ•˜๊ธฐ ์œ„ํ•ด image-level generation์˜ ์ตœ๋Œ€ token ๊ธธ์ด๋ฅผ 1600์œผ๋กœ, region-level generation์˜ ๊ฒƒ์„ 40์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋‹น caption generation์„ ์œ„ํ•œ ์ตœ๋Œ€ ์˜์—ญ ์ˆ˜๋Š” 16์œผ๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.

5.2. Zero-Shot Detection Transfer Ability

LVIS์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ:

  • LVIS๋Š” 1203๊ฐœ ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ง„ detection ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.
  • ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ชฉํ‘œ์™€ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ LLMDet์€ ๋‹ค์–‘ํ•œ backbone์—์„œ LVIS minival์—์„œ ๊ธฐ์ค€์„  MM-GDINO๋ฅผ 3.3%/3.8%/14.3% AP์™€ 3.1%/3.3%/17.0% APr๋กœ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ODinW์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ:

  • ODinW๋Š” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ๊ณผ ์–ดํœ˜์— ๊ฑธ์นœ 35๊ฐœ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ์Œ์œผ๋กœ, open-vocabulary detection์„ ์œ„ํ•œ ๋„์ „์ ์ด๊ณ  ํฌ๊ด„์ ์ธ ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค.
  • LLMDet์€ ODinW35์—์„œ ๊ฐ€์žฅ ๋†’์€ AP๋ฅผ ์–ป์–ด ๊ด‘๋ฒ”์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ์˜ ๋›ฐ์–ด๋‚œ ์ „์ด ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

COCO-O์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ:

  • COCO-O๋Š” COCO์™€ ๋™์ผํ•œ 80๊ฐœ ํด๋ž˜์Šค๋ฅผ ๊ณต์œ ํ•˜์ง€๋งŒ sketch, weather, cartoon, painting, tattoo, handmake์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.
  • LLMDet์€ ์—ฌ์ „ํžˆ MM-GDINO๋ฅผ 2.1% AP๋กœ ๋Šฅ๊ฐ€ํ•˜์—ฌ ๋„๋ฉ”์ธ ๋ณ€ํ™”์— ๋” ๊ฐ•๊ฑดํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Referring expression comprehension ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ:

  • Referring expression comprehension(REC)์€ ๊ตฌ๋ฌธ์œผ๋กœ ์–ธ๊ธ‰๋œ ๊ฐ์ฒด๋ฅผ ์ง€์—ญํ™”ํ•˜๋Š” ์ž‘์—…์œผ๋กœ, ํฌ๊ด„์ ์ธ ์–ธ์–ด ์ดํ•ด์™€ ์„ธ๋ฐ€ํ•œ vision-language alignment๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์ƒ์„ธํ•œ caption์„ ์‚ฌ์šฉํ•˜์—ฌ LLM๊ณผ co-trainingํ•จ์œผ๋กœ์จ LLMDet์€ ํ’๋ถ€ํ•œ ์‹œ๊ฐ์  ์„ธ๋ถ€์‚ฌํ•ญ์„ ํ’๋ถ€ํ•œ vision-language alignment๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

5.3. Ablation Study

์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ๊ณผ:

  • Grounding annotation๋งŒ์œผ๋กœ fine-tuningํ•˜๋ฉด ์„ฑ๋Šฅ์ด 41.4% AP์—์„œ 43.8% AP๋กœ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  • Region-level generation๋งŒ์œผ๋กœ๋Š” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š๋Š”๋ฐ, ์ด๋Š” LLMDet์˜ region-level caption์ด ์˜์—ญ์˜ ํด๋ž˜์Šค ์ด๋ฆ„์ด๋‚˜ grounding phrase์ผ ๋ฟ์ด์–ด์„œ ์ถ”๊ฐ€ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • Image-level generation๋งŒ ์‚ฌ์šฉํ•ด๋„ ์„ฑ๋Šฅ์ด ์•ฝ๊ฐ„ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  • Image-level๊ณผ region-level generation์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜๋ฉด LLM์˜ ์ง€๋„ ์‹ ํ˜ธ์˜ ์ด์ต์„ ์™„์ „ํžˆ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ƒ์„ธํ•œ caption์—์„œ ํ•™์Šต๋œ ํ’๋ถ€ํ•œ vision-language ํ‘œํ˜„์ด rare class ์ธ์‹์— ๋„์›€์ด ๋˜์–ด 3.9% APr๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ Large Language Model์˜ ํšจ๊ณผ:

  • ๊ธฐ๋ณธ์ ์œผ๋กœ Qwen2-0.5b-instruct์—์„œ fine-tuning๋œ LLaVA-OneVision-0.5b-ov์˜ LLM์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • LLaVA-OneVision-0.5b-ov์˜ LLM์€ ํ’๋ถ€ํ•œ multi-modal ๋ฐ์ดํ„ฐ๋กœ pre-train๋˜์—ˆ์ง€๋งŒ ๋‹ค๋ฅธ vision encoder๋ฅผ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ pretraining์ด ์—ฌ์ „ํžˆ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  • ํŠนํžˆ rare class์—์„œ (+2.2% APr) ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

์ƒ์„ฑ๋œ Caption ํ’ˆ์งˆ์˜ ํšจ๊ณผ:

  • GroundingCap-1M์˜ caption์ด ๊ฐ€์žฅ ๋†’์€ ์ƒ์„ธ์„ฑ ์ ์ˆ˜์™€ ์ ๋‹นํ•œ ํ™˜๊ฐ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๋ฐ์ดํ„ฐ์…‹์˜ ๋›ฐ์–ด๋‚œ ํ’ˆ์งˆ์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
  • ์ธ๊ฐ„์ด ์ฃผ์„ํ•œ caption์€ ํ™˜๊ฐ์ด ์ ์ง€๋งŒ(0.90 vs 1.34), LLaVA caption์—์„œ๋„ ์—ฌ์ „ํžˆ ํ™˜๊ฐ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Conclusion

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ์กด open-vocabulary detector์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋ชฉํ‘œ๋ฅผ ํƒ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. Large language model์„ ํ™œ์šฉํ•˜์—ฌ image-level ์ƒ์„ธ caption๊ณผ region-level ๊ฑฐ์นœ grounding phrase๋ฅผ ๋ชจ๋‘ ์ƒ์„ฑํ•จ์œผ๋กœ์จ detector๋Š” ์ƒ์„ธํ•œ caption์œผ๋กœ๋ถ€ํ„ฐ ๋” ๋งŽ์€ ์ •๋ณด์™€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ํฌ๊ด„์  ์ดํ•ด๋ฅผ ๋ฐ›๊ณ  ํ’๋ถ€ํ•œ vision-language ํ‘œํ˜„์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ detector์ธ LLMDet์€ ๊ด‘๋ฒ”์œ„ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐœ์„ ๋œ LLMDet์ด ๊ฐ•๋ ฅํ•œ large multi-modal model์„ ๊ตฌ์ถ•ํ•˜์—ฌ ์ƒํ˜ธ ์ด์ต์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๊ฐ€ ์ตœ๊ณ  ์„ฑ๋Šฅ์˜ large language model๋กœ vision model์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜๊ธฐ๋ฅผ ํฌ๋งํ•ฉ๋‹ˆ๋‹ค.



-->