[Paper Review] RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Posted by Euisuk's Dev Log on September 14, 2025

[Paper Review] RT-DETRv2: Improved Baseline with Bag-of-Freebies for

Real-Time Detection Transformer

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-RT-DETRv2-Improved-Baseline-with-Bag-of-Freebies-forReal-Time-Detection-Transformer

https://arxiv.org/pdf/2407.17140

๋ณธ ๋ฆฌ๋ทฐ๋Š” ์›๋ฌธ์„ ์ตœ๋Œ€ํ•œ ์ง์—ญํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ โ€œ์šฐ๋ฆฌ๋Š”โ€์€ ์ €์ž๋ฅผ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์ดˆ๋ก

์ด ๋ณด๊ณ ์„œ์—์„œ๋Š” ๊ฐœ์„ ๋œ ์‹ค์‹œ๊ฐ„ Detection Transformer์ธ RT-DETRv2๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. RT-DETRv2๋Š” ๊ธฐ์กด์˜ ์ตœ์‹  ์‹ค์‹œ๊ฐ„ detector์ธ RT-DETR์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜์—ˆ์œผ๋ฉฐ, ์œ ์—ฐ์„ฑ๊ณผ ์‹ค์šฉ์„ฑ์„ ์œ„ํ•œ bag-of-freebies๋ฅผ ๋„์ž…ํ•˜๊ณ  ํ›ˆ๋ จ ์ „๋žต์„ ์ตœ์ ํ™”ํ•˜์—ฌ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์œ ์—ฐ์„ฑ ๊ฐœ์„ ์„ ์œ„ํ•ด, deformable attention์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ feature๋“ค์— ๋Œ€ํ•ด ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์ˆ˜์˜ sampling point๋ฅผ ์„ค์ •ํ•˜์—ฌ decoder๊ฐ€ ์„ ํƒ์  multi-scale feature ์ถ”์ถœ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์‹ค์šฉ์„ฑ ํ–ฅ์ƒ์„ ์œ„ํ•ด์„œ๋Š” YOLO๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ RT-DETR ํŠน์œ ์˜ grid_sample operator๋ฅผ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์„ ํƒ์  discrete sampling operator๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ DETR๋“ค๊ณผ ์—ฐ๊ด€๋œ ๋ฐฐํฌ ์ œ์•ฝ์‚ฌํ•ญ์„ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ์ „๋žต ์ธก๋ฉด์—์„œ๋Š” ์†๋„ ์†์‹ค ์—†์ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด dynamic data augmentation๊ณผ scale-adaptive hyperparameter ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์†Œ์Šค ์ฝ”๋“œ์™€ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ https://github.com/lyuwenyu/RT-DETR์—์„œ ์ œ๊ณต๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

  1. ์„œ๋ก 

๊ฐ์ฒด ํƒ์ง€(Object detection)๋Š” ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ธฐ๋ณธ์ ์ธ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๊ทธ ์ค‘์—์„œ๋„ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€๋Š” ์ž์œจ์ฃผํ–‰๊ณผ ๊ฐ™์€ ๊ด‘๋ฒ”์œ„ํ•œ ์‘์šฉ ๋ถ„์•ผ๋ฅผ ๊ฐ€์ง„ ์ค‘์š”ํ•œ ์˜์—ญ์ž…๋‹ˆ๋‹ค. ์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„์˜ ๋ฐœ์ „์„ ํ†ตํ•ด YOLO detector๋“ค์€ ์˜์‹ฌ์˜ ์—ฌ์ง€์—†์ด ์ด ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ๊ถŒ์œ„ ์žˆ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” YOLO detector๋“ค์ด ๋‹ฌ์„ฑํ•œ ํ•ฉ๋ฆฌ์ ์ธ ๊ท ํ˜•(reasonable balance) ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

RT-DETR v1์˜ ๋“ฑ์žฅ์€ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์  ๋ฐฉํ–ฅ์„ ์—ด์–ด์ฃผ์—ˆ์œผ๋ฉฐ, ์ด ๋ถ„์•ผ์—์„œ YOLO์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ๊นจ๋œจ๋ ธ์Šต๋‹ˆ๋‹ค.

  • RT-DETR์€ DETR์˜ vanilla Transformer encoder๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ํšจ์œจ์ ์ธ hybrid encoder๋ฅผ ์ œ์•ˆํ–ˆ๋Š”๋ฐ, ์ด๋Š” multi-scale feature๋“ค์˜ intra-scale ์ƒํ˜ธ์ž‘์šฉ๊ณผ cross-scale ์œตํ•ฉ์„ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ ์ถ”๋ก  ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

์„ฑ๋Šฅ์„ ๋”์šฑ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด RT-DETR์€ uncertainty-minimal query selection์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์ด๋Š” uncertainty๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ decoder์— ๊ณ ํ’ˆ์งˆ์˜ ์ดˆ๊ธฐ query๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ RT-DETR์€ ๊ด‘๋ฒ”์œ„ํ•œ detector ํฌ๊ธฐ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ ์žฌํ›ˆ๋ จ ์—†์ด ๋‹ค์–‘ํ•œ ์‹ค์‹œ๊ฐ„ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋งž์ถฐ ์œ ์—ฐํ•œ ์†๋„ ์กฐ์ •์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ณด๊ณ ์„œ์—์„œ๋Š” ๊ฐœ์„ ๋œ ์‹ค์‹œ๊ฐ„ detection Transformer์ธ RT-DETRv2๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  • ์ด ์ž‘์—…์€ ์ตœ๊ทผ์˜ RT-DETR์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜์—ˆ์œผ๋ฉฐ, DETR family ๋‚ด์—์„œ ์œ ์—ฐ์„ฑ๊ณผ ์‹ค์šฉ์„ฑ์„ ์œ„ํ•œ bag-of-freebies๋ฅผ ์ œ๊ณตํ•˜๊ณ  ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ์ „๋žต์„ ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, RT-DETRv2๋Š” deformable attention module ๋‚ด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ feature๋“ค์— ๋Œ€ํ•ด ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์ˆ˜์˜ sampling point๋ฅผ ์„ค์ •ํ•˜์—ฌ decoder๊ฐ€ ์„ ํƒ์  multi-scale feature ์ถ”์ถœ์„ ๋‹ฌ์„ฑํ•  ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • ์‹ค์šฉ์„ฑ ํ–ฅ์ƒ ์˜์—ญ์—์„œ RT-DETRv2๋Š” DETR ํŠน์œ ์˜ ๊ธฐ์กด grid_sample operator๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ์„ ํƒ์  discrete sampling operator๋ฅผ ์ œ๊ณตํ•˜์—ฌ detection Transformer๋“ค๊ณผ ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฐ๊ด€๋œ ๋ฐฐํฌ ์ œ์•ฝ์‚ฌํ•ญ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ RT-DETRv2๋Š” ์†๋„ ์†์‹ค ์—†์ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชฉํ‘œ๋กœ dynamic data augmentation๊ณผ scale-adaptive hyperparameter ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ํฌํ•จํ•œ ํ›ˆ๋ จ ์ „๋žต์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฒฐ๊ณผ๋Š” RT-DETRv2๊ฐ€ RT-DETR์„ ์œ„ํ•œ bag-of-freebies์™€ ํ•จ๊ป˜ ๊ฐœ์„ ๋œ baseline์„ ์ œ๊ณตํ•˜๊ณ , ์œ ์—ฐ์„ฑ๊ณผ ์‹ค์šฉ์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค๋ฉฐ, ์ œ์•ˆ๋œ ํ›ˆ๋ จ ์ „๋žต์ด ์„ฑ๋Šฅ๊ณผ ํ›ˆ๋ จ ๋น„์šฉ์„ ์ตœ์ ํ™”ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  1. ๋ฐฉ๋ฒ•๋ก 

RT-DETRv2์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” RT-DETR๊ณผ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋˜๋ฉฐ, decoder์˜ deformable attention module์—๋งŒ ์ˆ˜์ •์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

2.1 ํ”„๋ ˆ์ž„์›Œํฌ

์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์— ๋Œ€ํ•œ ๊ตฌ๋ณ„๋œ sampling point ์ˆ˜

ํ˜„์žฌ DETR๋“ค์€ multi-scale feature๋กœ ๊ตฌ์„ฑ๋œ ๊ธด ์ž…๋ ฅ ์‹œํ€€์Šค๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ๋†’์€ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด deformable attention module์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ . DAT : Vision Transformer with Deformable Attention

  • RT-DETR์˜ decoder๋Š” ์ด ๋ชจ๋“ˆ์„ ์œ ์ง€ํ•˜๋Š”๋ฐ, ๊ฐ ์Šค์ผ€์ผ์—์„œ ๋™์ผํ•œ ์ˆ˜์˜ sampling point๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ œ์•ฝ์ด ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ feature๋“ค์˜ ๋ณธ์งˆ์  ์ฐจ์ด๋ฅผ ๋ฌด์‹œํ•˜๊ณ  deformable attention module์˜ feature ์ถ”์ถœ ๋Šฅ๋ ฅ์„ ์ œํ•œํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

  • ๋”ฐ๋ผ์„œ ๋ณด๋‹ค ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ feature ์ถ”์ถœ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์— ๋Œ€ํ•ด ๊ตฌ๋ณ„๋œ ์ˆ˜์˜ sampling point๋ฅผ ์„ค์ •ํ•  ๊ฒƒ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Discrete Sampling

RT-DETR์˜ ์‹ค์šฉ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ณ  ์–ด๋””์„œ๋“  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” YOLO๋“ค๊ณผ RT-DETR์˜ ๋ฐฐํฌ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋น„๊ตํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ท„์Šต๋‹ˆ๋‹ค.

  • RT-DETR ํŠน์œ ์˜ grid_sample operator๊ฐ€ ๊ด‘๋ฒ”์œ„ํ•œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ grid_sample์„ ๋Œ€์ฒดํ•˜๋Š” ์„ ํƒ์  discrete_sample operator๋ฅผ ์ œ์•ˆํ•˜์—ฌ RT-DETR์˜ ๋ฐฐํฌ ์ œ์•ฝ์‚ฌํ•ญ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) grid_sample operator๋Š” deformable attention์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” PyTorch์˜ ๋‚ด์žฅ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด operator๋Š” ์—ฐ์†์ ์ธ ์ขŒํ‘œ์—์„œ feature๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ๋•Œ bilinear interpolation์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•ํ•œ ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ์˜ˆ์ธก๋œ sampling offset์— ๋Œ€ํ•ด ๋ฐ˜์˜ฌ๋ฆผ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์‹œ๊ฐ„ ์†Œ๋ชจ์ ์ธ bilinear interpolation์„ ์ƒ๋žตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜์˜ฌ๋ฆผ ์—ฐ์‚ฐ์€ ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ sampling offset ์˜ˆ์ธก์— ์‚ฌ์šฉ๋˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜์˜ gradient๋ฅผ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค.

  • (์ฐธ๊ณ ) ์‹ค์ œ๋กœ๋Š” ํ›ˆ๋ จ์—์„œ ๋จผ์ € grid_sample operator๋ฅผ ์‚ฌ์šฉํ•œ ๋‹ค์Œ discrete_sample operator๋กœ fine-tuning์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ถ”๋ก ๊ณผ ๋ฐฐํฌ์—์„œ๋Š” ๋ชจ๋ธ์ด discrete_sample operator๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# RT-DETR๊ณผ ๋™์ผํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์œ ์ง€ํ•˜๋˜, ๋””์ฝ”๋”์˜ deformable attention ๋ชจ๋“ˆ๋งŒ ์ˆ˜์ •

Input Image (640ร—640)
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       CNN Backbone (ResNet)        โ”‚  โ† CNN์œผ๋กœ feature ์ถ”์ถœ
โ”‚ C3(80ร—80) โ†’ C4(40ร—40) โ†’ C5(20ร—20)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            Hybrid Encoder          โ”‚  โ† Transformer Encoder
โ”‚  Intra-scale + Cross-scale fusion โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Transformer Decoder        โ”‚  โ† Deformable Attention
โ”‚ - Deformable Attention ๊ฐœ์„         โ”‚
โ”‚ - Distinct sampling points        โ”‚
โ”‚ - Optional discrete sampling     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
Detection Heads

2.2 ํ›ˆ๋ จ ๋ฐฉ์‹

Dynamic Data Augmentation

๋ชจ๋ธ์— ๊ฐ•๊ฑดํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๊ฐ–์ถ”๊ธฐ ์œ„ํ•ด dynamic data augmentation ์ „๋žต์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • ์ดˆ๊ธฐ ํ›ˆ๋ จ ๊ธฐ๊ฐ„ ๋™์•ˆ detector์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ์ข‹์ง€ ์•Š๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•˜์—ฌ, ๋” ๊ฐ•ํ•œ data augmentation์„ ์ ์šฉํ•˜๊ณ  ํ›„๊ธฐ ํ›ˆ๋ จ ๊ธฐ๊ฐ„์—๋Š” ๊ทธ ์ˆ˜์ค€์„ ๋‚ฎ์ถฐ detector๊ฐ€ ๋ชฉํ‘œ ๋„๋ฉ”์ธ์˜ ํƒ์ง€์— ์ ์‘ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ์ดˆ๊ธฐ ๊ธฐ๊ฐ„์—๋Š” RT-DETR data augmentation์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋งˆ์ง€๋ง‰ ๋‘ epoch์—์„œ๋Š” RandomPhotometricDistort, RandomZoomOut, RandomIoUCrop, MultiScaleInput์„ ๋น„ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

Scale-adaptive Hyperparameter ์ปค์Šคํ„ฐ๋งˆ์ด์ง•

์šฐ๋ฆฌ๋Š” ๋˜ํ•œ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ scaled RT-DETR๋“ค์ด ๋™์ผํ•œ optimizer hyperparameter๋กœ ํ›ˆ๋ จ๋˜์–ด ์ฐจ์„ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ scaled RT-DETR๋“ค์„ ์œ„ํ•œ scale-adaptive hyperparameter ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐ€๋ฒผ์šด detector(์˜ˆ: ResNet18)์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ backbone์ด ๋” ๋‚ฎ์€ feature ํ’ˆ์งˆ์„ ๊ฐ€์ง„๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•˜์—ฌ ํ•™์Šต๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค.
  • ๋ฐ˜๋Œ€๋กœ, ํฐ detector(์˜ˆ: ResNet101)์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ backbone์€ ๋” ๋†’์€ feature ํ’ˆ์งˆ์„ ๊ฐ€์ง€๋ฏ€๋กœ ํ•™์Šต๋ฅ ์„ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.
  1. ์‹คํ—˜

3.1 ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ

RT-DETR๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ImageNet์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ResNet์„ backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , batch size 16์œผ๋กœ AdamW optimizer๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RT-DETRv2๋ฅผ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค.

  • ema_decay = 0.9999์ธ exponential moving average (EMA)๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ ํƒ์  discrete sampling์˜ ๊ฒฝ์šฐ, ๋จผ์ € grid_sample operator๋กœ 6ร— ์‚ฌ์ „ ํ›ˆ๋ จํ•œ ๋‹ค์Œ discrete_sample operator๋กœ 1ร— fine-tuning์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • Scale-adaptive hyperparameter ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์˜ hyperparameter๋Š” ํ‘œ 1์— ๋‚˜์™€ ์žˆ์œผ๋ฉฐ, ์—ฌ๊ธฐ์„œ lr์€ ํ•™์Šต๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.


ํ‘œ 1: RT-DETRv2์˜ hyperparameter

3.2 ํ‰๊ฐ€

RT-DETRv2๋Š” COCO train2017์—์„œ ํ›ˆ๋ จ๋˜๊ณ  COCO val2017 dataset์—์„œ ๊ฒ€์ฆ๋ฉ๋‹ˆ๋‹ค. 0.50์—์„œ 0.95๊นŒ์ง€ 0.05 ๋‹จ๊ณ„๋กœ ๊ท ๋“ฑํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง๋œ IoU threshold์— ๋Œ€ํ•ด ํ‰๊ท ํ™”๋œ ํ‘œ์ค€ AP ๋ฉ”ํŠธ๋ฆญ๊ณผ ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” AP^val_50์„ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

3.3 ๊ฒฐ๊ณผ

RT-DETR๊ณผ์˜ ๋น„๊ต๋Š” ํ‘œ 2์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค. RT-DETRv2๋Š” ์†๋„ ์†์‹ค ์—†์ด ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ detector๋“ค์—์„œ RT-DETR์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ 2: RT-DETR๊ณผ RT-DETRv2์˜ ๋น„๊ต

FPS๋Š” TensorRT FP16์„ ์‚ฌ์šฉํ•œ T4 GPU์—์„œ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๋ชจ๋“  ์ž…๋ ฅ ํฌ๊ธฐ๋Š” 640ร—640์œผ๋กœ ๊ณ ์ •๋ฉ๋‹ˆ๋‹ค.

3.4 Ablation ์—ฐ๊ตฌ

Sampling Point์— ๋Œ€ํ•œ Ablation

grid_sample operator์˜ ์ด sampling point ์ˆ˜์— ๋Œ€ํ•œ ablation ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด sampling point ์ˆ˜๋Š” num_head ร— num_point ร— num_query ร— num_decoder๋กœ ๊ณ„์‚ฐ๋˜๋ฉฐ, ์—ฌ๊ธฐ์„œ num_point๋Š” ๊ฐ ๊ทธ๋ฆฌ๋“œ์—์„œ ๊ฐ ์Šค์ผ€์ผ feature์— ๋Œ€ํ•œ sampling point์˜ ํ•ฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • ๊ฒฐ๊ณผ๋Š” sampling point ์ˆ˜๋ฅผ ์ค„์—ฌ๋„ ์„ฑ๋Šฅ์— ํฐ ์ €ํ•˜๊ฐ€ ์—†์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค(ํ‘œ 3 ์ฐธ์กฐ). ์ด๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์‚ฐ์—… ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์‹ค์šฉ์  ์ ์šฉ์ด ์˜ํ–ฅ๋ฐ›์ง€ ์•Š์„ ๊ฒƒ์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ 3: Sampling Point์— ๋Œ€ํ•œ Ablation

Discrete Sampling์— ๋Œ€ํ•œ Ablation

grid_sample์„ ์ œ๊ฑฐํ•˜๊ณ  discrete_sample๋กœ ๋Œ€์ฒดํ•˜๋Š” ablation์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ด ์ž‘์—…์ด APval50AP^val_50APval5โ€‹0์—์„œ ๋ˆˆ์— ๋„๋Š” ๊ฐ์†Œ๋ฅผ ์ผ์œผํ‚ค์ง€ ์•Š์œผ๋ฉด์„œ DETR๋“ค์˜ ๋ฐฐํฌ ์ œ์•ฝ์‚ฌํ•ญ์„ ์ œ๊ฑฐํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค(ํ‘œ 4 ์ฐธ์กฐ).

ํ‘œ 4: Discrete Sampling์— ๋Œ€ํ•œ Ablation

  1. ๊ฒฐ๋ก 

์ด ๋ณด๊ณ ์„œ์—์„œ๋Š” ๊ฐœ์„ ๋œ ์‹ค์‹œ๊ฐ„ detection Transformer์ธ RT-DETRv2๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. RT-DETRv2๋Š” RT-DETR์˜ ์œ ์—ฐ์„ฑ๊ณผ ์‹ค์šฉ์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•œ bag-of-freebies๋ฅผ ์ œ๊ณตํ•˜๊ณ , ์†๋„ ์†์‹ค ์—†์ด ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ์ „๋žต์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด ๋ณด๊ณ ์„œ๊ฐ€ DETR family์— ๋Œ€ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜๊ณ  RT-DETR ์‘์šฉ์˜ ๋ฒ”์œ„๋ฅผ ๋„“ํžˆ๊ธฐ๋ฅผ ํฌ๋งํ•ฉ๋‹ˆ๋‹ค.



-->