[Paper Review] OmDet_Turbo : Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Posted by Euisuk's Dev Log on September 14, 2025

[Paper Review] OmDet_Turbo : Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-OmDetTurbo-Real-time-Transformer-๊ธฐ๋ฐ˜-Efficient-Fusion-Head๋ฅผ-ํ™œ์šฉํ•œ-Open-Vocabulary-Detection

https://arxiv.org/pdf/2403.06892

์ดˆ๋ก

End-to-end transformer ๊ธฐ๋ฐ˜ detector (DETRs)๋Š” ์–ธ์–ด modality ํ†ตํ•ฉ์„ ํ†ตํ•ด closed-set๊ณผ open-vocabulary object detection (OVD) ์ž‘์—… ๋ชจ๋‘์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋†’์€ ์—ฐ์‚ฐ ์š”๊ตฌ์‚ฌํ•ญ์œผ๋กœ ์ธํ•ด ์‹ค์‹œ๊ฐ„ object detection (OD) ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์‹ค์šฉ์ ์ธ ์ ์šฉ์ด ์ œํ•œ๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” OVDEval ๋ฒค์น˜๋งˆํฌ์˜ ๋‘ ์ฃผ์š” ๋ชจ๋ธ์ธ OmDet๊ณผ Grounding-DINO์˜ ํ•œ๊ณ„๋ฅผ ๋ฉด๋ฐ€ํžˆ ๋ถ„์„ํ•˜๊ณ , OmDet-Turbo๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋Š” OmDet๊ณผ Grounding-DINO์—์„œ ๊ด€์ฐฐ๋œ ๋ณ‘๋ชฉํ˜„์ƒ์„ ์™„ํ™”ํ•˜๋„๋ก ์„ค๊ณ„๋œ ํ˜์‹ ์ ์ธ Efficient Fusion Head (EFH) ๋ชจ๋“ˆ์„ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ์ƒˆ๋กœ์šด transformer ๊ธฐ๋ฐ˜ ์‹ค์‹œ๊ฐ„ OVD ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ฃผ๋ชฉํ•  ์ ์€ OmDet-Turbo-Base๊ฐ€ TensorRT์™€ language cache ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์ดˆ๋‹น 100.2 frame (FPS)๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

COCO์™€ LVIS ๋ฐ์ดํ„ฐ์…‹์˜ zero-shot ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ OmDet-Turbo๋Š” ํ˜„์žฌ state-of-the-art supervised ๋ชจ๋ธ๋“ค๊ณผ ๊ฑฐ์˜ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ ์ˆ˜์ค€์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ODinW์™€ OVDEval์—์„œ ๊ฐ๊ฐ 30.1 AP์™€ 26.86 NMS-AP๋กœ ์ƒˆ๋กœ์šด state-of-the-art ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ˆ˜๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๊ณผ ์šฐ์ˆ˜ํ•œ ์ถ”๋ก  ์†๋„๋Š” ์‚ฐ์—… ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ์˜ OmDet-Turbo์˜ ์‹ค์šฉ์„ฑ์„ ๊ฐ•์กฐํ•˜๋ฉฐ, ์‹ค์‹œ๊ฐ„ object detection ์ž‘์—…์„ ์œ„ํ•œ ๋งค๋ ฅ์ ์ธ ์„ ํƒ์œผ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•ฉ๋‹ˆ๋‹ค.

Keywords:
Multi-Dataset Pre-Training, Zero/Few-Shot Detection, Task-Conditioned Detection, Deep Fusion Mechanism, language-aware object detection, Continual Learning

  1. ์„œ๋ก 

Object Detection (OD)์€ ๋‹ค์–‘ํ•œ deep neural network์˜ ํ†ตํ•ฉ์„ ํ†ตํ•ด ์ƒ๋‹นํ•œ ์ง„์ „์„ ์ด๋ฃฌ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์˜ ๊ธฐ๋ณธ ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ close-set OD ๋ฐฉ๋ฒ•๋“ค์€ ๊ด‘๋ฒ”์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ๊ฑฐ์ณ ์ ์ฐจ ์•ˆ์ •ํ™”๋˜์—ˆ์œผ๋ฉฐ, ์ฃผ๋กœ ๋‘ ๋ฐฉํ–ฅ์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•œ detector ๊ตฌ์กฐ ๊ฐœ์„ ๊ณผ ๋” ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐ€์ง„ ์‹ค์‹œ๊ฐ„ detector ๊ฐœ๋ฐœ์ž…๋‹ˆ๋‹ค.

stage ๊ด€์ ์—์„œ ๋ณด๋ฉด, Faster R-CNN๊ณผ ๊ฐ™์€ two-stage ์ ‘๊ทผ๋ฒ•๊ณผ YOLO ์‹œ๋ฆฌ์ฆˆ์™€ ๊ฐ™์€ one-stage ์ ‘๊ทผ๋ฒ•์ด ์ž˜ ์•Œ๋ ค์ง„ ๋ฐฉ๋ฒ•๋“ค์ž…๋‹ˆ๋‹ค.

Faster R-CNN - 2 stage

YOLO ์‹œ๋ฆฌ์ฆˆ - 1 stage

Anchor ๊ด€์ ์—์„œ๋Š” anchor ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์—์„œ anchor-free ๋ฐฉ๋ฒ•์œผ๋กœ ์—ฐ๊ตฌ๊ฐ€ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ ์ธก๋ฉด์—์„œ๋Š” CNN ๊ธฐ๋ฐ˜ detector์™€ Transformer ๊ธฐ๋ฐ˜ detector (DETRs)๊ฐ€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” OD ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) Anchor ๊ธฐ๋ฐ˜ vs Anchor-free ๋ฐฉ๋ฒ•

  • Anchor ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•: ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ, ๋ฏธ๋ฆฌ ์ •์˜๋œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ โ€œanchor boxโ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ:
    • Faster R-CNN
    • YOLOv2~YOLOv4
  • Anchor-free ๋ฐฉ๋ฒ•: anchor ์—†์ด ์ง์ ‘ ๊ฐ์ฒด์˜ ์ค‘์‹ฌ์ ์ด๋‚˜ ๊ฒฝ๊ณ„ ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ:
    • CornerNet
    • DETR

์ฐธ๊ณ 

๊ตฌ๋ถ„ Anchor ๊ธฐ๋ฐ˜ Anchor-free
CNN ๊ธฐ๋ฐ˜ Faster R-CNN, SSD, YOLOv3 FCOS, CenterNet, CornerNet
Transformer ๊ธฐ๋ฐ˜ ์ผ๋ถ€ Deformable DETR (anchor-like ๊ตฌ์กฐ ์‚ฌ์šฉ) DETR, DINO-DETR

๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ์‹ค์‹œ๊ฐ„ detector์˜ ๋Œ€๋ถ€๋ถ„์ด CNN ๊ธฐ๋ฐ˜ detector์ธ ๋ฐ˜๋ฉด, ์ตœ๊ทผ DETRs์— ๋Œ€ํ•œ ๊ด€์‹ฌ์ด ๊ธ‰์ฆํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์€ ๊ฐ„๊ฒฐํ•œ pipeline๊ณผ end-to-end ์ ‘๊ทผ๋ฒ• ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. DINO์™€ Co-DETR๊ณผ ๊ฐ™์€ ๋ช‡ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด DETR ๋ฐฉ๋ฒ•๋“ค์€ COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ๊ฐ 63.2์™€ 65.9 AP๋กœ state-of-the-art ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•˜๋ฉฐ ์ธ์ƒ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ RT-DETR์€ hybrid encoder๋ฅผ ํ†ตํ•ด CNN ๊ธฐ๋ฐ˜ detector๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์‹ค์‹œ๊ฐ„ object detection์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ „ํ†ต์ ์ธ OD ๋ชจ๋ธ๋“ค์€ ๊ณตํ†ต์ ์ธ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹คโ€”ํ›ˆ๋ จ vocabulary๋ฅผ ๋ฒ—์–ด๋‚œ ๊ฐ์ฒด๋“ค๋กœ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋Š” ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜๊ณผ ์‚ฐ์—… ํ™˜๊ฒฝ์˜ ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

Object detection์˜ ์ƒˆ๋กœ์šด ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์€ open-vocabulary object detection (OVD)์œผ๋กœ, ์ด๋Š” ์–ธ์–ด ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ detector๋ฅผ ์•ˆ๋‚ดํ•จ์œผ๋กœ์จ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด์„  ๊ฐ์ฒด๋“ค์„ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ OVD ๋ชจ๋ธ์€ ์–ธ์–ด modality๋ฅผ close-set detector์— ํ†ตํ•ฉํ•˜์—ฌ ๊ฐœ๋ฐœ๋ฉ๋‹ˆ๋‹ค.

OVDEval ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” OVD ๋ชจ๋ธ๋“ค ์ค‘์—์„œ, OmDet์€ Sparse-RCNN ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๊ณ  Multimodal Detection Network (MDN)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ recursive head์—์„œ latent query๋ฅผ ์œตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

OmDet

(์ฐธ๊ณ ) Sparse R-CNN์€ ๊ณ ์ •๋œ ์ˆ˜์˜ learnable object queries๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” end-to-end ๋ฐฉ์‹์˜ detector์ž…๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) MDN(Multimodal Detection Network)์€ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ์œตํ•ฉํ•˜๋Š” ๋ชจ๋“ˆ์ž…๋‹ˆ๋‹ค.

  • ๋ฐ˜๋ฉด Grounding-DINO๋Š” DETR ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๊ณ  neck, head, query initialization stage์—์„œ fusion mechanism์„ ํ†ตํ•ฉํ•˜์—ฌ multimodal ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

Grounding-DINO

์ด๋Ÿฌํ•œ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๊ธฐ์กด OVD ๋ชจ๋ธ๋“ค์€ ๋†’์€ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ๊ณผ ๊ธด ์ถ”๋ก  ์‹œ๊ฐ„์œผ๋กœ ์ธํ•ด ์ƒ์—…์  ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ์˜ ์‹ค์ œ ๋ฐฐํฌ๊ฐ€ ๋ฐฉํ•ด๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ด€๋ จ ์—ฐ๊ตฌ

2.1 Transformer ๊ธฐ๋ฐ˜ Detection

Transformer ๊ธฐ๋ฐ˜ object detection ๋ฐฉ๋ฒ•๋“ค์€ ์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ์ƒ๋‹นํ•œ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋ฒ•๋“ค์€ transformer์˜ ํž˜์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ ๋ฐ์ดํ„ฐ์—์„œ long-range dependency์™€ contextual ์ •๋ณด๋ฅผ ์บก์ฒ˜ํ•ฉ๋‹ˆ๋‹ค. DETR์€ ์ด๋Ÿฌํ•œ ์œ ํ˜•์˜ ๋ชจ๋ธ์—์„œ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

DETR์€ transformer encoder-decoder ์•„ํ‚คํ…์ฒ˜์™€ bipartite matching์„ ํ†ตํ•œ unique prediction์„ ๊ฐ•์ œํ•˜๋Š” set-based global loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. DETR ๋ชจ๋ธ์˜ ์ถœ์‹œ ์ดํ›„, ์—ฐ๊ตฌ์ž๋“ค์€ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์—์„œ DETR๊ณผ ๊ฐ™์€ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐ ์˜ค๋žœ ์‹œ๊ฐ„์„ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค.

Sparse-RCNN, DN-DETR, DINO, RT-DETR๊ณผ ๊ฐ™์€ vision-and-language ๋ชจ๋ธ์˜ ์ตœ๊ทผ ๋ฐœ์ „์€ object detection์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  downstream detection ์ž‘์—…์œผ๋กœ์˜ transferability๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ์œ ๋งํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ DETR๊ณผ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ ์ „ํ†ต์ ์ธ closed-set object detection ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์‹œ๋‚˜๋ฆฌ์˜ค์™€ Transformer๊ฐ€ ๊ฐ€์ ธ์•ผ ํ•  ์ž ์žฌ๋ ฅ์„ ํฌ๊ฒŒ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.

2.2 Open-Vocabulary Object Detection

Open-vocabulary object detection (OVD)์€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฏธ๋ฆฌ ์ •์˜๋œ target category์— ์ œํ•œ๋˜์ง€ ์•Š๊ณ  ์ž์—ฐ์–ด๋กœ ์ •์˜๋œ target category๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์ „ํ†ต์ ์ธ object detection ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์ „ํ†ต์ ์ธ object detection ์‹œ์Šคํ…œ์€ ๊ณ ์ •๋œ ๊ฐ์ฒด ํด๋ž˜์Šค ์ง‘ํ•ฉ์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋˜์–ด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค. OVR-CNN๊ณผ ๊ฐ™์€ ์ดˆ๊ธฐ OVD ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ณ ์ •๋œ category์˜ bounding box annotation๊ณผ ๋” ๋‹ค์–‘ํ•œ category๋ฅผ ๋‹ค๋ฃจ๋Š” image-caption pair๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ image-text pair์—์„œ cross-modal contrastive learning์„ ์ˆ˜ํ–‰ํ•˜๋Š” CLIP๊ณผ ALIGN๊ณผ ๊ฐ™์€ vision-and-language ์ ‘๊ทผ๋ฒ•์„ ํƒ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. CLIP๊ณผ ALIGN์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ViLD๋Š” ๋›ฐ์–ด๋‚œ zero-shot object recognition ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋Š” two-stage object detection ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

GLIP์€ object detection์„ phrase grounding ๋ฌธ์ œ๋กœ ์žฌ๊ณต์‹ํ™”ํ•˜์—ฌ grounding๊ณผ ๋Œ€๊ทœ๋ชจ image-text paired data์˜ ์‚ฌ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Sparse-RCNN์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ OmDet์€ ์ž์—ฐ์–ด๋ฅผ ์ง€์‹ ํ‘œํ˜„์˜ ํ†ต์ผ๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

Grounding DINO๋Š” transformer ๊ธฐ๋ฐ˜ detection ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DINO๋ฅผ grounding task pre-training๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์†์„ฑ์„ ๊ฐ€์ง„ target expression์˜ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์„ ์ง€์›ํ•จ์œผ๋กœ์จ OVD ๋ชจ๋ธ์˜ ์‹ค์šฉ์„ฑ์„ ํฌ๊ฒŒ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

CORA์™€ BARON๊ณผ ๊ฐ™์€ ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ ํ˜„์žฌ OVD ๋ฐฉ๋ฒ•๋“ค ๋‚ด์˜ ์ง€์†์ ์ธ ๋„์ „๊ณผ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ OVD ๋ชจ๋ธ์˜ ํ•œ๊ณ„๋Š” ๋ชจ๋ธ ๊ทœ๋ชจ์™€ ๋ณต์žก์„ฑ์— ์žˆ์–ด ์‹ค์‹œ๊ฐ„ ์ถ”๋ก ์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ์ ์œผ๋กœ, ์ด๋Š” ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์ฑ„ํƒ์— ๋„์ „์ด ๋˜๋ฉฐ ์‹œ๊ธ‰ํ•œ ํ•ด๊ฒฐ์ด ํ•„์š”ํ•œ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

2.3 Real-time Object Detection

YOLO ์‹œ๋ฆฌ์ฆˆ, EfficientDet, RT-DETR๊ณผ ๊ฐ™์€ object detection ๋ชจ๋ธ๋“ค์€ one-stage ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก  ๊ณผ์ •์—์„œ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ํฌ๊ฒŒ ์ค„์ž„์œผ๋กœ์จ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

YOLO-World๋Š” ์‹ค์‹œ๊ฐ„ open-vocabulary object detection์˜ ์‹œ๋„์ž…๋‹ˆ๋‹ค. YOLO ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋Œ€ํ‘œ์ ์ธ one-stage ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ณ„์Šนํ•˜๊ณ , CLIP text encoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ์ž„๋ฒ ๋”ฉํ•œ ํ›„ ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋œ re-parameterizable Vision-Language Path Aggregation Network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ํŠน์ง•์„ ์ด๋ฏธ์ง€ ํŠน์ง•๊ณผ ์œตํ•ฉํ•ฉ๋‹ˆ๋‹ค. YOLO-World๋Š” ์‹ค์‹œ๊ฐ„ ์ธ์‹์˜ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ CNN ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์ด๋ฉฐ ์ผ๋ฐ˜ํ™” ์ •ํ™•๋„๊ฐ€ transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋’ค์ฒ˜์ง‘๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ open-vocabulary object detection ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•˜๋Š” ์ตœ์ดˆ์˜ ์‹ค์‹œ๊ฐ„ transformer ๊ธฐ๋ฐ˜ end-to-end OVD ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  1. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•: OmDet-Turbo

3.1 ๋ชจ๋ธ ๊ตฌ์กฐ

OmDet-Turbo์˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” text backbone(TTT), image backbone(III), ๊ทธ๋ฆฌ๊ณ  Efficient Fusion Head (EFHEFHEFH) ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ ์ž…๋ ฅ์€ ์ž‘์—…์„ ์„ค๋ช…ํ•˜๋Š” prompt, ๊ฐ์ฒด label ์ง‘ํ•ฉ, ๊ทธ๋ฆฌ๊ณ  ๊ฒ€์ถœํ•  ์ด๋ฏธ์ง€๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

OmDet-Turbo๋Š” ์œ ์—ฐ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด GLIPGLIPGLIP๊ณผ ๋‹ฌ๋ฆฌ promptpromptprompt์™€ labellabellabel encoderencoderencoder๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๋™์ผํ•œ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๊ณต์œ ํ•˜์ง€ ์•Š๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Label์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฒ€์ถœํ•  ๊ฐ์ฒด์˜ ์ด๋ฆ„์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ˜๋ฉด, prompt๋Š” ๊ฐ์ฒด ์ด๋ฆ„์˜ ์กฐํ•ฉ, Visual Question Answering (VQA) ์ž‘์—…์˜ ์งˆ๋ฌธ, ๋˜๋Š” ๋Œ€๊ทœ๋ชจ vocabulary ๊ฒ€์ถœ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ โ€œ๋ชจ๋“  ๊ฐ์ฒด ๊ฒ€์ถœโ€๊ณผ ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ ์ง€์‹œ์‚ฌํ•ญ ๋“ฑ ๋‹ค์–‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) GLIP์€ ํ…์ŠคํŠธ ์ธ์ฝ”๋”(์˜ˆ: BERT ๊ณ„์—ด)๋ฅผ ์‚ฌ์šฉํ•ด prompt + label์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ํ•ฉ์ณ์„œ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

OmDet-Turbo๋Š” Text backbone TTT๋Š” prompt ppp์™€ K๊ฐœ์˜ label L=[l1,โ€ฆ,lK]L = [l_1, โ€ฆ, l_K]L=[l1โ€‹,โ€ฆ,lKโ€‹]์˜ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์ธ์ฝ”๋”ฉํ•˜์—ฌ label embedding๊ณผ prompt embedding์„ ์ƒ์„ฑํ•˜๋Š” CLIP๊ณผ ๊ฐ™์€ transformer language model์ž…๋‹ˆ๋‹ค.

  • ๊ตฌ์ฒด์ ์œผ๋กœ ๊ฐ label e(l)e(l)e(l)์˜ sentence-level embedding์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด text backbone ์ถœ๋ ฅ์˜ [cls][cls][cls] token์„ label embedding์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Prompt embedding e(p)e(p)e(p)์˜ ๊ฒฝ์šฐ, prompt์— ๋Œ€ํ•œ fine-grained ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด sentence-level embedding ๋Œ€์‹  text backbone TTT์˜ token-level embedding ์ถœ๋ ฅ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

Image backbone III๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„ multi-scale image feature pyramid P3,P4,P5{P3, P4, P5}P3,P4,P5๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ง„ EFHEFHEFH๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” Efficient Language-Aware Encoder (ELA-Encoder)์™€ Efficient Language-Aware Decoder (ELA-Decoder)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

ELA-Encoder

ํšจ์œจ์ ์ธ ์–ธ์–ด ์ธ์‹ ์ธ์ฝ”๋” (ELA-Encoder): ์ด ์ธ์ฝ”๋”๋Š” ์‹œ๊ฐ์  ๋ฐฑ๋ณธ์—์„œ ์ œ๊ณต๋˜๋Š” ๋‹ค์ค‘ ์Šค์ผ€์ผ ํ”ผ์ฒ˜ ๋งต์„ ํ™œ์šฉํ•˜์—ฌ ํ”„๋กฌํ”„ํŠธ์™€ ๊ด€๋ จ๋œ ์ฟผ๋ฆฌ ์ œ์•ˆ์„ ํšจ์œจ์ ์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

RT-DETR์—์„œ ์†Œ๊ฐœ๋œ efficient hybrid encoder๋ฅผ ๋”ฐ๋ผ, multi-scale image backbone feature์˜ ๋งˆ์ง€๋ง‰ ์ธต์ธ P5P5P5๋Š” multilayer Multi-head Self-Attention (MHSA) ๋ชจ๋“ˆ์„ ํ†ตํ•ด ์ธ์ฝ”๋”ฉ๋˜์–ด F5F5F5๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

๊ทธ ๋‹ค์Œ Cross-scale Feature-fusion Module (CCFM)์€ PANet๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ convolutional layer๋ฅผ ํ™œ์šฉํ•˜์—ฌ F5์—์„œ P4, P3๋กœ top feature๋ฅผ ์œตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ด ์ ‘๊ทผ๋ฒ•์€ ์ •ํ™•๋„ ์ˆ˜์ค€์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์†๋„๋ฅผ 35% ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์œ„ encoder๋กœ๋ถ€ํ„ฐ ์–ป์€ O๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, top-K encoder feature๊ฐ€ decoder์˜ ์ดˆ๊ธฐ object position query๋กœ ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. Language-aware selection ๊ณผ์ •์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด label embedding์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

ELA-Decoder

ํšจ์œจ์ ์ธ ์–ธ์–ด ์ธ์‹ ๋””์ฝ”๋” (ELA-Decoder): ์ด ๋ชจ๋“ˆ์€ ์‹œ๊ฐ์  ํŠน์ง•๊ณผ ์–ธ์–ด ํŠน์ง•์˜ ์œตํ•ฉ ๊ณผ์ •์„ ๋‹จ์ˆœํ™”ํ•˜๋ฉฐ, ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต๊ณผ OVD(Open-Vocabulary Detection) ๊ธฐ๋Šฅ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Image cross-attention๊ณผ Text cross-attention์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํŠน์ง•์„ ์œตํ•ฉํ•˜๋Š” Grounding DINO ์ ‘๊ทผ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ, vision-language fusion ๊ณผ์ •์„ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ decoder layer์—์„œ ๋จผ์ € query feature์™€ prompt feature๋ฅผ concatenateํ•œ ํ›„ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•œ multi-head self-attention mechanism์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ query feature๋Š” deformable attention์„ ํ†ตํ•ด ์‹œ๊ฐ์  ํŠน์ง•์— attention์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋ถ„๋ฆฌ๋œ Prompt์™€ Label Encoder

์ด์ „ grounding ๊ธฐ๋ฐ˜ OVD ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ๋ฒ•์€ ๋ชจ๋“  ํด๋ž˜์Šค๋ฅผ object detection์„ ์œ„ํ•œ prompt๋กœ ์ง์ ‘ concatenateํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  detection ์ž‘์—…์˜ prompt์™€ ํด๋ž˜์Šค๋ฅผ ๋ณ„๋„๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. Prompt์™€ label์„ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ detection ์ž‘์—…์˜ prompt๊ฐ€ ๋”์šฑ ์œ ์—ฐํ•ด์ง‘๋‹ˆ๋‹ค.

๋” ์œ ์—ฐํ•œ prompt๋Š” language cache์™€ multi-task learning๊ณผ ๊ฐ™์€ ๊ธฐ๋ฒ•์„ ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด ์ ‘๊ทผ๋ฒ•์€ ๋” ํฐ vocabulary ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

Language Cache

์šฐ๋ฆฌ์˜ ์ ‘๊ทผ๋ฒ•์—์„œ๋Š” ์‹œ๊ฐ์  ๋ฐ ํ…์ŠคํŠธ ํŠน์ง• ์ถ”์ถœ ์ค‘์— multi-modal feature ์ƒํ˜ธ์ž‘์šฉ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, image backbone๊ณผ text backbone์ด ์™„์ „ํžˆ ๋…๋ฆฝ์ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ…Œ์ŠคํŠธ ๋ฐ ๋ฐฐํฌ ๋‹จ๊ณ„์—์„œ target label๊ณผ target detection prompt์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ๋ฏธ๋ฆฌ ์ถ”์ถœํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ๋‚˜ GPU ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3.2 ๋ชจ๋ธ ํ›ˆ๋ จ

Multi-task Learning

Open-vocabulary detection ์ž‘์—…์— ๋Œ€ํ•œ ์ด์ „ ์—ฐ๊ตฌ๋“ค์€ object detection์˜ ๊ณ ์ •๋œ ์‚ฌ๊ณ ๋ฐฉ์‹์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ๋ชปํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ฃผ๋กœ detection ์ž‘์—…์— ์ œํ•œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํด๋ž˜์Šค์™€ ์ž‘์—…์„ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ grounding, object detection (OD), visual question answering (VQA), human-object interaction (HOI) ๋“ฑ๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ด๋Ÿฌํ•œ ์ž‘์—…์˜ ๋ฐ์ดํ„ฐ์…‹์„ pre-training์— ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ํŽธ๋ฆฌํ•ด์ง‘๋‹ˆ๋‹ค.

  • Grounding:
    ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ํŠน์ • ์˜์—ญ์„ ์—ฐ๊ฒฐ (์˜ˆ: โ€œthe red carโ€ โ†’ ์ด๋ฏธ์ง€ ๋‚ด ์œ„์น˜)
  • Object Detection (OD):
    ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด ์œ„์น˜์™€ ํด๋ž˜์Šค ์˜ˆ์ธก
  • Visual Question Answering (VQA):
    ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ์— ๋‹ต๋ณ€
  • Human-Object Interaction (HOI):
    ์‚ฌ๋žŒ๊ณผ ๊ฐ์ฒด ๊ฐ„์˜ ๊ด€๊ณ„ ์ธ์‹ (์˜ˆ: โ€œ์‚ฌ๋žŒ์ด ์ปต์„ ๋“ค๊ณ  ์žˆ๋‹คโ€)

Large Vocabulary๋กœ์˜ ํ™•์žฅ

ํด๋ž˜์Šค์™€ ๋…๋ฆฝ์ ์ธ ๋ถ„๋ฆฌ๋œ ์ž‘์—…์€ large-vocabulary ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒ€์ถœํ•  target ํด๋ž˜์Šค๊ฐ€ ๋งŽ์„ ๋•Œ, ์ด๋“ค์„ ํ•จ๊ป˜ concatenateํ•˜์—ฌ ์ž‘์—…์œผ๋กœ ๋งŒ๋“ค๋ฉด ๋งค์šฐ ๊ธด ์ž‘์—…์ด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” transformer ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์—์„œ ์ž‘์—…์„ ์ธ์ฝ”๋”ฉํ•  ๋•Œ ์ด์ฐจ์ ์œผ๋กœ ์ฆ๊ฐ€๋œ ์‹œ๊ฐ„ ์†Œ๋น„๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.

๋ถ„๋ฆฌ๋œ ์ž‘์—… ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด โ€œ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ๊ฐ์ฒด ๊ฒ€์ถœโ€๊ณผ ๊ฐ™์€ ์œ ์—ฐํ•œ ํ‘œํ˜„์„ detection ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ง€์‹œ์‚ฌํ•ญ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณผ๋„ํ•˜๊ฒŒ ๊ธด ์ž‘์—…์˜ ๋ฌธ์ œ์™€ ์–ธ์–ด ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ›„์† ์ง€์ˆ˜์  ์ธ์ฝ”๋”ฉ ์‹œ๊ฐ„ ์ฆ๊ฐ€๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐฉ๋ฒ•

ํ›ˆ๋ จ ๊ณผ์ • ์ „๋ฐ˜์— ๊ฑธ์ณ DINO์™€ ์ผ์น˜ํ•˜๊ฒŒ ๋ชจ๋ธ ์ˆ˜๋ ด์„ ๊ฐ€์†ํ™”ํ•˜๊ณ  ๋ชจ๋ธ ์ •๋ฐ€๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ denoise group์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ detr-base ๋ชจ๋ธ๊ณผ ์ผ์น˜ํ•˜๊ฒŒ, ์žฌ๊ตฌ์„ฑ ๋ฐ ์˜ˆ์ธก ๋‹จ๊ณ„์—์„œ detection ์ž‘์—…์˜ ์ฃผ์š” loss function์œผ๋กœ L1 loss์™€ GIOU loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ๋Š” ๊ฐ query embedding๊ณผ ํ…์ŠคํŠธ ํŠน์ง•์˜ dot product๋ฅผ ์ทจํ•œ ํ›„ focalloss๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  positive sample์˜ ๋ถ„๋ฅ˜์™€ localization ๊ฐ„์˜ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด RT-DETR์—์„œ ํšจ๊ณผ์ ์ž„์ด ์ž…์ฆ๋œ IoU-aware Query Selection์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. ์‹คํ—˜

4.1 ์‹คํ—˜ ์„ค์ •

Pre-training ๋ฐ์ดํ„ฐ

OmDet-Turbo-Base์˜ pre-training ๊ณผ์ •์€ multi-task learning ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. Localization ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด object detection์šฉ O365 ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. Grounding ์ž‘์—…์„ ์œ„ํ•ด์„œ๋Š” ์ด์ „ ์—ฐ๊ตฌ์—์„œ ์ผ๋ฐ˜ํ™” ๊ฐœ์„ ์— ํšจ๊ณผ์ ์ž„์ด ์ž…์ฆ๋œ GoldG ๋ฐ์ดํ„ฐ์…‹์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ human-object interaction (HOI) ๋Šฅ๋ ฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด Hake ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ •์ œ๋œ ๋ฒ„์ „์˜ HOIA๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ •์ œ๋œ HOI-A ๋ฒ„์ „์—์„œ๋Š” ๋น„ํ•ฉ๋ฆฌ์ ์ธ triplet ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•œ ์ž˜๋ชป๋œ annotation์„ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ phrase grounding์— ์ „์šฉ๋œ PhraseCut ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•˜์—ฌ ๋ชจ๋ธ ๋‚ด ์ง€์—ญ๊ณผ ํ…์ŠคํŠธ ๊ฐ„์˜ alignment๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ

OmDet-Turbo-Base ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ image backbone์œผ๋กœ ConvNext Base๋ฅผ, text backbone์œผ๋กœ CLIP ViT-B/16์˜ text encoder๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด text backbone์˜ ์ฒซ 6๊ฐœ layer๋ฅผ ๋™๊ฒฐํ•˜๊ณ  ๋งˆ์ง€๋ง‰ 4๊ฐœ layer๋งŒ fine-tuneํ•ฉ๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ์ค‘ base learning rate๋ฅผ 0.0001๋กœ ์„ค์ •ํ•˜๊ณ  ์ „์ฒด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ 70%์™€ 90%์—์„œ 0.1์˜ decay๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ 16๊ฐœ์˜ NVIDIA A100 GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ batch size 64๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.

4.2 ์ฃผ์š” ๊ฒฐ๊ณผ

์ผ๋ฐ˜์ ์ธ OD ๋ฒค์น˜๋งˆํฌ COCO์™€ LVIS์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ

COCO์™€ LVIS๋Š” OD ๋ชจ๋ธ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๋„๋ฆฌ ์ธ์ •๋ฐ›๋Š” ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. COCO์—์„œ OmDet-Turbo-Base๋Š” ์ธ์ƒ์ ์ธ 53.4 AP๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ ๋†€๋ผ์šด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋” ์ž‘์€ ๋ชจ๋ธ ํฌ๊ธฐ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  GLIP-L์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

LVIS์—์„œ์˜ zero-shot ์„ฑ๋Šฅ๊ณผ ๊ด€๋ จํ•˜์—ฌ, OmDet-Turbo-Base๋Š” ๋‘ ๊ฐœ์˜ large-sized ๋ชจ๋ธ์ธ GLIP-L๊ณผ Grounding-DINO-L์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ๋Œ€๊ทœ๋ชจ vocabulary์—์„œ ๊ฐ์ฒด๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๊ฐ•์กฐํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ๊ฒ€์ถœ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ณต์žกํ•œ OD ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ Zero-shot ์„ฑ๋Šฅ

ODinW ๋ฒค์น˜๋งˆํฌ์—์„œ OmDet-Turbo-Base๋Š” 30.1์˜ zero-shot AP๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ Grounding-DINO-L๊ณผ OmDet-B์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ๋‹ค์–‘ํ•œ ์‹ค์ œ ์ž‘์—…์— ๋Œ€ํ•œ ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ์ „์ด ๋Šฅ๋ ฅ๊ณผ ์ ์‘์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

OVD๋ฅผ ์œ„ํ•ด ์„ธ์‹ฌํ•˜๊ฒŒ ์„ค๊ณ„๋œ ๋ฒค์น˜๋งˆํฌ์ธ OVDEval์—์„œ OmDet-Turbo-Base๋Š” 26.86์˜ NMS-AP ์ ์ˆ˜๋กœ ์ƒˆ๋กœ์šด State-of-the-Art (SOTA) ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์ถ”๋ก  ์†๋„

์‹ค์šฉ์  ๊ทœ๋ชจ์—์„œ ๋ชจ๋ธ์˜ ์ถ”๋ก  ์†๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด 80๊ฐœ category๋กœ ๊ตฌ์„ฑ๋œ COCO val2017 ๋ฐ์ดํ„ฐ์…‹์˜ 5000๊ฐœ ์ด๋ฏธ์ง€์—์„œ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ์˜ ๋น„๊ต์—์„œ OmDet-Turbo-Base๋Š” PyTorch์—์„œ 18.6 FPS, TensorRT์—์„œ 100.2 FPS์˜ ์ธ์ƒ์ ์ธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ์†๋„๋Š” ๊ฒฝ์Ÿ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์•ฝ 20๋ฐฐ ๋น ๋ฅด๋ฉฐ, object detection ์ž‘์—… ์ฒ˜๋ฆฌ์—์„œ ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ๊ณผ ์†๋„๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

4.3 Ablation Study

์„ฑ๋Šฅ ๋ถ„์„

๋™์ผํ•œ pre-training ๋ฐ์ดํ„ฐ์…‹๊ณผ image backbone์„ ์‚ฌ์šฉํ•˜๋Š” zero-shot ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•  ๋•Œ OmDet-Turbo-Tiny์˜ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  • OmDet-Turbo-Tiny๋Š” COCO์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” zero-shot ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ  LVIS minival์—์„œ 30.3 AP์˜ ์ตœ๊ณ  ์ ์ˆ˜๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ƒ์„ธํ•œ ์ถ”๋ก  ์†๋„ ๋ถ„์„

์šฐ๋ฆฌ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฐœ์„ ์‚ฌํ•ญ๊ณผ OmDet๊ณผ Grounding-DINO์— ์กด์žฌํ•˜๋Š” ๋ณ‘๋ชฉํ˜„์ƒ์˜ ์ œ๊ฑฐ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ์„ธ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ๋„ค ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ์š”์†Œ๋กœ ์„ธ๋ถ„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค: text backbone, image backbone, encoder/FPN, decoder/head. ๊ทธ ๋‹ค์Œ ๊ฐ ๊ตฌ์„ฑ์š”์†Œ์˜ ์†Œ์š” ์‹œ๊ฐ„์„ ์„ธ์‹ฌํ•˜๊ฒŒ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

OmDet-Turbo๋Š” ๋ชจ๋“  ๊ตฌ์„ฑ์š”์†Œ์—์„œ ๋‹ค๋ฅธ ๋‘ ๋ชจ๋ธ์„ ์ผ๊ด€๋˜๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, encoder/FPN๊ณผ decoder/head ์„น์…˜์—์„œ ํŠนํžˆ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ๊ฐœ์„ ์„ ๋ณด์ž…๋‹ˆ๋‹ค. Encoder/FPN ๊ตฌ์„ฑ์š”์†Œ์—์„œ Grounding-DINO๋Š” feature enhancer layer์˜ heavy multi-modality computation์œผ๋กœ ์ธํ•ด ์ƒ๋‹นํ•œ ์†๋„ ์ €ํ•˜๋ฅผ ๊ฒช์Šต๋‹ˆ๋‹ค.

OmDet-Turbo๋Š” hybrid encoder๋ฅผ ๊ตฌํ˜„ํ•˜์—ฌ ์ด ๋ณ‘๋ชฉํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ณ , Grounding-DINO์™€ ๋น„๊ตํ•˜์—ฌ encoder/FPN ๊ณผ์ •์—์„œ ์•ฝ 10๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. Decoder/head ๊ตฌ์„ฑ์š”์†Œ์™€ ๊ด€๋ จํ•˜์—ฌ, OmDet์˜ ์›๋ž˜ MDN์€ feature ์ถ”์ถœ์„ ์œ„ํ•ด ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”๋˜๋Š” ROIAlign์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. OmDet-Turbo๋Š” ELA-Decoder๋ฅผ ๋„์ž…ํ•˜์—ฌ ROI operation์˜ ํ•„์š”์„ฑ์„ ์ œ๊ฑฐํ•˜๊ณ  decoder/head ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ์ƒ๋‹นํžˆ ๊ฐ€์†ํ™”ํ•ฉ๋‹ˆ๋‹ค.

  1. ๊ฒฐ๋ก 

๊ฒฐ๋ก ์ ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ชจ๋‘์—์„œ ๋›ฐ์–ด๋‚œ ์‹ค์‹œ๊ฐ„ transformer ๊ธฐ๋ฐ˜ open-vocabulary object detection ๋ชจ๋ธ์ธ OmDet-Turbo๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋†’์€ detection ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ open-vocabulary ์‹œ๋‚˜๋ฆฌ์˜ค์˜ ๋„์ „๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ์œผ๋กœ์จ, OmDet-Turbo๋Š” ์‹ค์ œ object detection ์ž‘์—…์„ ์œ„ํ•œ ๋งค๋ ฅ์ ์ธ ์†”๋ฃจ์…˜์œผ๋กœ ๋‹๋ณด์ž…๋‹ˆ๋‹ค.

Efficient Fusion Head ๋ชจ๋“ˆ์€ encoder์™€ head ๊ตฌ์„ฑ์š”์†Œ์˜ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ์ค„์—ฌ detection ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ๋” ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ๋œ OmDet-Turbo-Base๋Š” ODinW์™€ OVDEval๊ณผ ๊ฐ™์€ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ์…‹์—์„œ state-of-the-art ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ๋›ฐ์–ด๋‚œ zero-shot detection ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์‹ค์ œ ๋ฐฐํฌ์™€ ์‹ค์‹œ๊ฐ„ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ์ค‘์ ์„ ๋‘” OmDet-Turbo๋Š” ๊ฒฌ๊ณ ํ•œ detection ๋Šฅ๋ ฅ๊ณผ ํšจ์œจ์ ์ธ ์ถ”๋ก  ์†๋„ ๊ฐ„์˜ ๊ท ํ˜•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Open-vocabulary ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ๊ณผ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ ์ง€ํ‘œ๊ฐ€ ๊ฒฐํ•ฉ๋˜์–ด, ์‚ฐ์—… object detection ์ž‘์—…์„ ์œ„ํ•œ ์œ ๋งํ•œ ์„ ํƒ์œผ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•ฉ๋‹ˆ๋‹ค.

ํ˜์‹ ์ ์ธ ์„ค๊ณ„ ์„ ํƒ๊ณผ ์„ธ์‹ฌํ•œ ์ตœ์ ํ™”์˜ ์กฐํ•ฉ์„ ํ†ตํ•ด OmDet-Turbo๋Š” ์‹ค์‹œ๊ฐ„ transformer ๊ธฐ๋ฐ˜ object detection ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋ฐœ์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.



-->