[Paper Review] MM-Groundung-DINO : An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Posted by Euisuk's Dev Log on September 15, 2025

[Paper Review] MM-Groundung-DINO : An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-MM-Groundung-DINO-An-Open-and-Comprehensive-Pipeline-for-Unified-Object-Grounding-and-Detection

https://arxiv.org/pdf/2401.02361

๋ณธ ๋ฆฌ๋ทฐ๋Š” ์›๋ฌธ์„ ์ตœ๋Œ€ํ•œ ์ง์—ญํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ โ€œ์šฐ๋ฆฌ๋Š”โ€์€ ์ €์ž๋ฅผ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์ดˆ๋ก

Grounding-DINO๋Š” Open-Vocabulary Detection (OVD), Phrase Grounding (PG), Referring Expression Comprehension (REC)์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ๋น„์ „ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ์ตœ์ฒจ๋‹จ open-set detection ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • ์ด ๋ชจ๋ธ์˜ ํšจ๊ณผ์„ฑ์œผ๋กœ ์ธํ•ด ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ฃผ๋ฅ˜ ์•„ํ‚คํ…์ฒ˜๋กœ ๋„๋ฆฌ ์ฑ„ํƒ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜์ง€๋งŒ ๊ทธ ์ค‘์š”์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์›๋ณธ Grounding-DINO ๋ชจ๋ธ์€ ํ›ˆ๋ จ ์ฝ”๋“œ์˜ ๋น„๊ณต๊ฐœ๋กœ ์ธํ•ด ํฌ๊ด„์ ์ธ ๊ณต๊ฐœ ๊ธฐ์ˆ  ์„ธ๋ถ€์‚ฌํ•ญ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

Image from Grounding Dino Paper

์ด๋Ÿฌํ•œ ๊ฒฉ์ฐจ๋ฅผ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•ด, ์ €ํฌ๋Š” MMDetection toolbox๋กœ ๊ตฌ์ถ•๋œ ์˜คํ”ˆ์†Œ์Šค์˜ ํฌ๊ด„์ ์ด๊ณ  ์‚ฌ์šฉ์ž ์นœํ™”์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์ธ MM-Grounding-DINO๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) MMDetection์€ ์˜คํ”ˆ์†Œ์Šค ๊ฐ์ฒด ํƒ์ง€(Object Detection) ํˆด๋ฐ•์Šค๋กœ, ์ฃผ๋กœ PyTorch ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋œ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์–‘ํ•œ ๊ฐ์ฒด ํƒ์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‰ฝ๊ฒŒ ํ•™์Šต, ํ‰๊ฐ€, ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ ์‚ฌ์ „ ํ›ˆ๋ จ์„ ์œ„ํ•ด ํ’๋ถ€ํ•œ ๋น„์ „ ๋ฐ์ดํ„ฐ์…‹์„, fine-tuning์„ ์œ„ํ•ด ๋‹ค์–‘ํ•œ detection ๋ฐ grounding ๋ฐ์ดํ„ฐ์…‹์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ์ €ํฌ๋Š” ๋ณด๊ณ ๋œ ๊ฐ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ๋ถ„์„๊ณผ ์žฌํ˜„์„ ์œ„ํ•œ ์„ธ๋ถ€ ์„ค์ •์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • ์–ธ๊ธ‰๋œ ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์€ ์ €ํฌ์˜ MM-Grounding-DINO-Tiny๊ฐ€ Grounding-DINO-Tiny baseline์„ ๋Šฅ๊ฐ€ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๋ชจ๋“  ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino

  1. ์„œ๋ก 

๊ฐ์ฒด detection ์ž‘์—…์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜์—ฌ ์ œ์•ˆ์„ ์–ป์€ ๋‹ค์Œ, ์ด๋ฅผ multi-modal alignment๋ฅผ ํ†ตํ•ด ํ…์ŠคํŠธ์™€ ๋งค์นญํ•˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•˜๋ฉฐ, ์ด๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์ฒจ๋‹จ multi-modal ์ดํ•ด ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ž…๋‹ˆ๋‹ค.

ํ˜„์žฌ ๊ฐ์ฒด detection์€ ์ž…๋ ฅ ํ…์ŠคํŠธ์˜ ์œ ํ˜•์— ๋”ฐ๋ผ ์„ธ ๊ฐ€์ง€ ํ•˜์œ„ ์ž‘์—…์œผ๋กœ ์„ธ๋ถ„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), Referring Expression Comprehension (REC).

  • Zero-shot ์„ค์ •์— ๋”ฐ๋ผ, Open-Vocabulary Detection (OVD) ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ํ›ˆ๋ จ๋˜์ง€๋งŒ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ์–ดํœ˜ ๋‚ด์—์„œ ๊ธฐ๋ณธ ๋ฐ ์ƒˆ๋กœ์šด ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋ชจ๋‘ ์˜ˆ์ธกํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • Phrase grounding (PG) ์ž‘์—…์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ชจ๋“  ํ›„๋ณด ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ตฌ๋ฌธ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ•ด๋‹น ๋ฐ•์Šค๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • Referring Expression Comprehension (REC) ์ž‘์—…์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ์„ค๋ช…์œผ๋กœ ์ง€์ •๋œ ๋Œ€์ƒ์„ ์ •ํ™•ํžˆ ์‹๋ณ„ํ•˜๊ณ  ์ดํ›„ bounding box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ ์œ„์น˜๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ์œ„์˜ ์ž‘์—…๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋งŽ์€ ๋น„์ „ grounding ๋ฐ detection ๋ชจ๋ธ์ด ํƒ๊ตฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ grounding ๋ชจ๋ธ ์ค‘์—์„œ Grounding-DINO๋Š” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์œผ๋กœ ์ฃผ๋ฅ˜ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • Closed-set detector DINO๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, Grounding-DINO-Large๋Š” COCO ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์—†์ด COCO์—์„œ ์ตœ์ฒจ๋‹จ zero-shot ์„ฑ๋Šฅ(mAP 52.5)์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Grounding-DINO๋Š” feature enhancer, query selection module, decoder๋ฅผ ํฌํ•จํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋‹จ๊ณ„์—์„œ ๋น„์ „๊ณผ ์–ธ์–ด modality์˜ ํ†ตํ•ฉ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ ์‹ฌ์ธต ์œตํ•ฉ ์ ‘๊ทผ๋ฒ•์€ open-set ๋งฅ๋ฝ์—์„œ ๊ฐ์ฒด detection์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ, DETR ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ๋Š” ํ•˜๋“œ์ฝ”๋”ฉ๋œ ๋ชจ๋“ˆ ์—†์ด end-to-end ๋„คํŠธ์›Œํฌ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  1. ์ ‘๊ทผ ๋ฐฉ๋ฒ•

์ด ์„น์…˜์—์„œ๋Š” ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์„ ์ž์„ธํžˆ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋‹ฌ๋ฆฌ ๋ช…์‹œ๋˜์ง€ ์•Š๋Š” ํ•œ, MM-G๋Š” MM-Grounding-DINO๋ฅผ, G-DINO๋Š” Grounding-DINO๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

2.1 ๋ชจ๋ธ

์–ธ๊ธ‰ํ•œ ๋ฐ”์™€ ๊ฐ™์ด, ์ €ํฌ ๋ชจ๋ธ์€ Grounding-DINO๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ ๊ฑฐ์˜ ๋ณ€๊ฒฝ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ €ํฌ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” Figure 3์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. [Batchsize, 3, H, W] ํ˜•ํƒœ์˜ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์„ค๋ช…์ด ์ฃผ์–ด์ง€๋ฉด, ์ €ํฌ ๋ชจ๋ธ์€ ์„ค๋ช…์„ ํ•ด๋‹น ์ƒ์„ฑ๋œ bounding box์™€ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ํ…์ŠคํŠธ ํŠน์„ฑ ์ถ”์ถœ์„ ์œ„ํ•œ ํ…์ŠคํŠธ backbone
  • ์ด๋ฏธ์ง€ ํŠน์„ฑ ์ถ”์ถœ์„ ์œ„ํ•œ ์ด๋ฏธ์ง€ backbone
  • ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํŠน์„ฑ์„ ๊นŠ์ด ์œตํ•ฉํ•˜๋Š” feature enhancer
  • query ์ดˆ๊ธฐํ™”๋ฅผ ์œ„ํ•œ language-guided query selection module
  • ๋ฐ•์Šค ๊ฐœ์„ ์„ ์œ„ํ•œ cross-modality decoder

ํŠน์„ฑ ์ถ”์ถœ ๋ฐ ์œตํ•ฉ:

  • ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์ด ์ฃผ์–ด์ง€๋ฉด, ์ด๋ฏธ์ง€ backbone์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์Šค์ผ€์ผ์—์„œ ์ด๋ฏธ์ง€ ํŠน์„ฑ์„ ์ถ”์ถœํ•˜๊ณ , ๋™์‹œ์— ํ…์ŠคํŠธ backbone์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ํŠน์„ฑ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‘ ํŠน์„ฑ์„ feature enhancer module์— ์ž…๋ ฅํ•˜์—ฌ cross-modality ์œตํ•ฉ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Language-Guided Query Selection:

  • ํ…์ŠคํŠธ๋ฅผ ๊ฐ์ฒด detection ๊ฐ€์ด๋“œ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด, Grounding-DINO๋Š” language-guided query selection module์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋ชจ๋“ˆ์€ ์ž…๋ ฅ ํ…์ŠคํŠธ ํŠน์„ฑ๊ณผ์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ num_query ์ œ์•ˆ์„ decoder query๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

Cross-modality Decoder:

  • Grounding-DINO์˜ cross-modality decoder layer๋Š” cross-modality ํ•™์Šต์„ ์œ„ํ•ด ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํŠน์„ฑ์„ ์ถ”๊ฐ€๋กœ ํ†ตํ•ฉํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • Self-attention ํ›„, ์•„ํ‚คํ…์ฒ˜๋Š” ์ด๋ฏธ์ง€ cross-attention layer, ํ…์ŠคํŠธ cross-attention layer, FFN layer๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ฐจ์ด์ :

MM-G์™€ G-DINO์˜ ์ฃผ์š” ์ฐจ์ด์ ์€ contrastive embedding module์— ์žˆ์Šต๋‹ˆ๋‹ค.

  • CLIP์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„, contrastive embedding module์„ ์ดˆ๊ธฐํ™”ํ•  ๋•Œ bias๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ์ดˆ๊ธฐ loss ๊ฐ’์„ ํฌ๊ฒŒ ์ค„์ด๊ณ  ๋ชจ๋ธ์˜ ์ˆ˜๋ ด์„ ๊ฐ€์†ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.2 ๋ฐ์ดํ„ฐ์…‹ ์ค€๋น„

์ €ํฌ ๋ฐ์ดํ„ฐ ํ˜•์‹์€ Open Grounding-DINO์˜ ํ˜•์‹์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ MMDetection์˜ ํ˜•์‹์œผ๋กœ ์ˆ˜์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค. MM-Grounding-DINO๋Š” ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์ฃผ์„์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์„ธ ๊ฐ€์ง€ ์ž‘์—…์„ ๋‹ค๋ฃจ๋„๋ก ์„ค๊ณ„๋˜์–ด, ์‚ฌ์šฉํ•œ 15๊ฐœ ๋ฐ์ดํ„ฐ์…‹์„ ์„ธ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์—ˆ์Šต๋‹ˆ๋‹ค.

OVD ๋ฐ์ดํ„ฐ์…‹:

  • ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ COCO, Objects365V1, Objects365V2, V3Det, Open-Images๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์€ COCO, LVIS, ODinW12/35๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

PG ๋ฐ์ดํ„ฐ์…‹:

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์€ GQA, GRIT, Flickr30K Entities๋ฅผ ํฌํ•จํ•˜๋ฉฐ, Flickr30K Entities ๋ฐ์ดํ„ฐ์…‹์€ ํ‰๊ฐ€์—๋„ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

REC ๋ฐ์ดํ„ฐ์…‹:

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์€ RefCOCO, RefCOCO+, RefCOCOg๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด์„œ๋Š” RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO, Description Detection Dataset(Dยณ)์„ ํฌํ•จํ•˜๋Š” ๋” ๊ด‘๋ฒ”์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹ ์„ธํŠธ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

2.3 ํ›ˆ๋ จ ์„ค์ •

ํ…์ŠคํŠธ ์ž…๋ ฅ ๊ทœ์น™:

  • OVD ํ›ˆ๋ จ์„ ์œ„ํ•ด, detection ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ๋“  ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ โ€œPeople. Ball. Racket. Cat.โ€๊ณผ ๊ฐ™์€ ๊ธด ๋ฌธ์ž์—ด๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
  • PG ๋ฐ REC ์ž‘์—…์˜ ๊ฒฝ์šฐ, M-DETR์„ ๋”ฐ๋ผ ์‚ฌ์ „ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ํ…์ŠคํŠธ ๋‚ด์—์„œ ์–ธ๊ธ‰๋˜๋Š” ๋ชจ๋“  ๊ฐ์ฒด์— ์ฃผ์„์„ ๋‹ฌ์•„ ์ด ์ž‘์—…์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์ ์šฉ์— ์•ฝ๊ฐ„์˜ ์ˆ˜์ •์„ ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ข…๋ฅ˜:

  • Grounding-DINO์™€ ์œ ์‚ฌํ•˜๊ฒŒ, ์–ธ์–ด ์ธ์ฝ”๋”๋กœ ์ž˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ BERT-based-uncased ๋ชจ๋ธ์„, ์ด๋ฏธ์ง€ backbone์œผ๋กœ Swin Transformer๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ฆ๊ฐ•:

  • ๋žœ๋ค ๋ฆฌ์‚ฌ์ด์ฆˆ, ๋žœ๋ค ํด๋ฆฝ, ๋žœ๋ค ํ”Œ๋ฆฝ ์™ธ์—๋„ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์—์„œ ๋žœ๋ค negative sample์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์—์„œ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋œ ์นดํ…Œ๊ณ ๋ฆฌ๋‚˜ ํ…์ŠคํŠธ ์„ค๋ช…์„ negative ์˜ˆ์ œ๋กœ ground-truth ์„ค๋ช…๊ณผ ํ•จ๊ป˜ positive ์˜ˆ์ œ๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค:

  • ์ด batch size 128๋กœ 30 epoch ๋™์•ˆ 32๊ฐœ์˜ NVIDIA 3090 GPU์—์„œ MM-G-Tiny๋ฅผ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.
  1. ์ฃผ์š” ๊ฒฐ๊ณผ

3.1 Zero-shot Transfer

Zero-shot ์„ค์ •์—์„œ, MM-G ๋ชจ๋ธ์€ ์ฒ˜์Œ์— ๊ธฐ๋ณธ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋˜๊ณ  ์ดํ›„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค.

COCO ๋ฒค์น˜๋งˆํฌ:

  • O365 ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋‹ค๋ฅธ PG/REC ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ MM-Grounding-DINO๋ฅผ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Grounding-DINO๋ฅผ ๋”ฐ๋ผ COCO ๋ฐ์ดํ„ฐ์…‹์„ zero-shot ํ•™์Šต baseline ์„ค์ •์— ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Table 3์—์„œ MM-Grounding-DINO-Tiny์™€ Grounding-DINO-Tiny๋ฅผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” O365๋กœ๋งŒ ํ›ˆ๋ จ๋œ MM-G(a)(mAP 48.5)์กฐ์ฐจ๋„ O365, Gold-G, Cap4M์œผ๋กœ ํ›ˆ๋ จ๋œ G-DINO(c)(mAP 48.4)๋ฅผ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  • Objects365, Gold-G, GRIT๋กœ ํ›ˆ๋ จ๋œ MM-G-T(c)๋Š” 50.5 mAP์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์–ด COCO ๋ฒค์น˜๋งˆํฌ์—์„œ G-DINO(c)๋ณด๋‹ค 2.1 AP ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

LVIS ๋ฒค์น˜๋งˆํฌ:

  • LVIS ๋ฐ์ดํ„ฐ์…‹์€ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด 1000๊ฐœ ์ด์ƒ์˜ ๊ณ ์œ  ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํฌํ•จํ•˜๋Š” long-tail detection ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.
  • Table 4์—์„œ MM-Grounding-DINO-Tiny์™€ Grounding-DINO-Tiny๋ฅผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Cap4M ์—†์ด O365์™€ GoldG๋กœ ํ›ˆ๋ จ๋œ MM-G(a)๊ฐ€ LVIS MiniVal๊ณผ Val ๋ชจ๋‘์—์„œ G-DINO(c)๋ฅผ +6.9AP ๋Šฅ๊ฐ€ํ•จ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค.

ODinW ๋ฒค์น˜๋งˆํฌ:

  • ODinW(Object Detection in the Wild) ๋ฒค์น˜๋งˆํฌ๋Š” ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋„๋ก ์„ค๊ณ„๋œ ๋” ์—„๊ฒฉํ•œ ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค.
  • ์ €ํฌ MM-G-T(c3)๋Š” ODinW13์—์„œ 53.3 mAP, ODinW35์—์„œ 28.4 mAP์˜ ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ G-DINO-T(c)๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

RefCOCO/+/g ๋ฐ gRefCOCO ๋ฒค์น˜๋งˆํฌ:

  • REC ์ž‘์—…์—์„œ MM-G์˜ zero-shot ๋Šฅ๋ ฅ๋„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. RefCOCO, RefCOCO+, RefCOCOg๋Š” REC ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์„ค์ •๋˜์—ˆ์œผ๋ฉฐ, ๊ฒฐ๊ณผ๋Š” Table 5์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ €ํฌ ๋ชจ๋ธ์€ RefCOCO์˜ ๋ชจ๋“  zero-shot ํ‰๊ฐ€ ์ง€ํ‘œ์—์„œ baseline์„ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3.2 GRIT ๋ถ„์„

GRIT๋Š” ์˜คํ”ˆ์†Œ์Šค๊ฐ€ ์•„๋‹Œ Cap4M์˜ ๋Œ€์ฒด์žฌ๋กœ ์‚ฌ์šฉ๋œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์œ„์˜ ๊ฒฐ๊ณผ์—์„œ ๋ณด๋“ฏ์ด GRIT์˜ ์„ฑ๋Šฅ์€ ๊ธฐ๋Œ€์— ๋ฏธ์น˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. GRIT์˜ ์ด๋ฏธ์ง€์™€ ์ฃผ์„์„ ๊ด€์ฐฐํ•œ ๊ฒฐ๊ณผ, ์ฃผ์š” ์ด์œ ๋“ค์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์—ด๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • GRIT์˜ ํ…์ŠคํŠธ ์ฃผ์„์€ COYO-700M๊ณผ LAION-2B์˜ ์บก์…˜์—์„œ spaCy๋กœ ์ถ”์ถœํ•œ ๊ตฌ๋ฌธ์ด๋‚˜ ๋ฌธ์žฅ์—์„œ ๋‚˜์˜ค๋ฉฐ, ์ธ๋ช…, ์ด๋ฒคํŠธ, ์‹œ์„ค, ์ง€์ •ํ•™์  ๊ฐœ์ฒด์™€ ๊ฐ™์€ ๋งŽ์€ ์ถ”์ƒ์  ๊ตฌ๋ฌธ์„ ํฌํ•จํ•˜์—ฌ ๋ชจ๋ธ์„ ์ž˜๋ชป๋œ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • GRIT ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ์ด๋ฏธ์ง€๋Š” ๋‹จ์ผ ์ฃผ์„์ด ํ•จ๊ป˜ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ์ฃผ์„์€ ์‹ค์ œ๋กœ๋Š” ์ด๋ฏธ์ง€์˜ ์ „์ฒด ์บก์…˜์ธ ๊ธด ๋ฌธ์žฅ๊ณผ ์ด๋ฏธ์ง€์˜ ์ „์ฒด ๋ฒ”์œ„์— ๊ฑฐ์˜ ๊ฑธ์ณ ์žˆ๋Š” ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๋ฐ•์Šค๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

3.3 Fine-tuning์„ ํ†ตํ•œ ๊ฒ€์ฆ

์ด ๋ณด๊ณ ์„œ์˜ ๊ธฐ๋ณธ fine-tuning์€ MM-G-T(c3) ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

COCO/LVIS์—์„œ์˜ Fine-tuning: MM-Grounding-DINO์˜ ๊ธฐ๋Šฅ์„ ์ฒ ์ €ํžˆ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์ฃผ์š” fine-tuning ์ ‘๊ทผ๋ฒ•์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค: close-set fine-tuning, open-set continuing pretraining fine-tuning, open-vocabulary fine-tuning.

Table 10์—์„œ ๋ณด๋“ฏ์ด, MM-G-T๋Š” close-set fine-tuning๊ณผ open-set continuing pretraining fine-tuning ๋ชจ๋‘๋ฅผ ํ†ตํ•ด COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ MM-G-T๋Š” 12 epoch์˜ close-set fine-tuning ํ›„ 7.8 mAP ์ฆ๊ฐ€ํ•˜์—ฌ 58.2 mAP์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์œ„ ์ž‘์—…์—์„œ์˜ Fine-tuning: MM-Grounding-DINO์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ํฌ๊ด„์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์ž‘์—…์œผ๋กœ ํ‰๊ฐ€๋ฅผ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์•ˆ๊ฐœ ์† ๊ฐ์ฒด detection: Real-world Task-driven Testing Set (RTTS)๋ฅผ ํ™œ์šฉํ–ˆ์œผ๋ฉฐ, MM-Grounding-DINO๋Š” 12 epoch์˜ fine-tuning ํ›„ 69.1 AP์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆ˜์ค‘ ๊ฐ์ฒด detection: Real-world Underwater Object Detection dataset (RUOD)์—์„œ ํ‰๊ฐ€ํ–ˆ์œผ๋ฉฐ, 12 epoch์˜ fine-tuning ํ›„ 35.7 mAP ํ–ฅ์ƒ์„ ๋ณด์—ฌ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋‡Œ์ข…์–‘ ๊ฐ์ฒด detection: Brain tumor ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€ํ–ˆ์œผ๋ฉฐ, ์ด ๋ฐ์ดํ„ฐ์…‹์€ ์„ค๋ช…์  ๋ผ๋ฒจ ์ •๋ณด ์—†์ด ์ˆซ์ž ์‹๋ณ„์ž๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋…ํŠนํ•œ ๋ผ๋ฒจ๋ง ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Cityscapes ๊ฐ์ฒด detection: 50๊ฐœ ๋„์‹œ์˜ ๊ฑฐ๋ฆฌ์—์„œ ์ดฌ์˜๋œ ์Šคํ…Œ๋ ˆ์˜ค ๋น„๋””์˜ค ์‹œํ€€์Šค๊ฐ€ ํฌํ•จ๋œ ๊ด‘๋ฒ”์œ„ํ•œ ๋„์‹œ ๊ฑฐ๋ฆฌ ์žฅ๋ฉด ์ปฌ๋ ‰์…˜์ž…๋‹ˆ๋‹ค.

  1. ๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์—์„œ ์ €ํฌ๋Š” Grounding-DINO๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ  ํ’๋ถ€ํ•œ ๋น„์ „ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ํฌ๊ด„์ ์ด๊ณ  ์˜คํ”ˆ์†Œ์Šค์ธ grounding baseline์ธ MM-Grounding-DINO๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” OVD, PG, REC ์ž‘์—…์„ ํฌ๊ด„์ ์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. OVD, PG, REC ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๋ชจ๋“  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ™•์žฅํ–ˆ์œผ๋ฉฐ, ๋ชจ๋“  ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” MMDetection์—์„œ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์–ธ๊ธ‰๋œ ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์€ ์ €ํฌ MM-Grounding-DINO๊ฐ€ Grounding-DINO baseline์„ ๋Šฅ๊ฐ€(๋˜๋Š” ๋™๋“ฑํ•œ ์„ฑ๋Šฅ)ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ €ํฌ ํŒŒ์ดํ”„๋ผ์ธ์ด grounding ๋ฐ detection ์ž‘์—…์˜ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ๊ท€์ค‘ํ•œ ์ž์› ์—ญํ• ์„ ํ•˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.



-->