[CV Notes] Lecture 18 - Videos

Posted by Euisuk's Dev Log on August 16, 2024

[CV Notes] Lecture 18 - Videos

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/CV-Notes-Lecture-18-Videos

๋‹ค์Œ์€ ์•„๋ž˜ โ€œLecture 18. Videosโ€์— ๋Œ€ํ•œ ์š”์•ฝ ๋ฐ ํ•„๊ธฐ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ‹€๋ฆฐ ๋‚ด์šฉ์ด ์žˆ๋‹ค๋ฉด ๋Œ“๊ธ€ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค ๐Ÿ™Œ

Lecture 18์˜ ๋‚ด์šฉ์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์‹ฌ๋„ ์žˆ๊ฒŒ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

1. 2D ์ด๋ฏธ์ง€์—์„œ 3D ๋ฐ ๋น„๋””์˜ค๋กœ์˜ ํ™•์žฅ

  • ์ด์ „ ๊ฐ•์˜ ๋ฆฌ๋ทฐ:
    • ์ด์ „ ๊ฐ•์˜์—์„œ๋Š” 2D ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด ๋ถ„๋ฅ˜, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ 2D ํ˜•์ƒ ์˜ˆ์ธก์— ์ง‘์ค‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ด์–ด์„œ 3D ํ˜•์ƒ ์˜ˆ์ธก์„ ๋‹ค๋ฃจ์—ˆ๊ณ , CNN์„ 3D๋กœ ํ™•์žฅํ•ด 2D ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ 3D ํ˜•์ƒ์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ 3D ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋…ผ์˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ ๊ฐ•์˜ ์ฃผ์ œ:
    • ์ด๋ฒˆ ๊ฐ•์˜์—์„œ๋Š” CNN์— ์‹œ๊ฐ„ ์ถ•์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๋น„๋””์˜ค๋Š” ์‹œ๊ฐ„ ์ถ•์ด ์ถ”๊ฐ€๋œ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค์ด๋ฏ€๋กœ, ์ด๋ฅผ 4์ฐจ์› ํ…์„œ๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

2. ๋น„๋””์˜ค์˜ ๊ตฌ์กฐ์™€ ๋„์ „ ๊ณผ์ œ

  • Video Tensor:

    • ๋น„๋””์˜ค๋Š” ๋‘ ๊ฐœ์˜ ๊ณต๊ฐ„ ์ถ•(H, W), ์ฑ„๋„ ์ถ•(RGB), ์‹œ๊ฐ„ ์ถ•(T)์œผ๋กœ ๊ตฌ์„ฑ๋œ 4์ฐจ์› ํ…์„œ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.
      • Video Tensor = 2D Images + Time (4D Tensor)

        (Time x RGB Channel(3) x Height x Width)

    • ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ๋Š” 3์ฐจ์› ๊ณต๊ฐ„ ์ •๋ณด์™€ 1์ฐจ์› ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
      • Task์— ๋”ฐ๋ผ์„œ Time x RGB Channel(3) x Height x Width์œผ๋กœ ์‚ฌ์šฉํ•  ๋•Œ๋„ ์žˆ๊ณ , RGB Channel(3) x Time x Height x Width์œผ๋กœ ์‚ฌ์šฉํ•  ๋•Œ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Image vs Video:

  • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ task:

    • ๊ฐ์ฒด ์ธ์‹์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์—์„œ ์ธ์‹ํ•˜๊ณ ์ž ํ•˜๋Š” ๋Œ€์ƒ์€ ์ฃผ๋กœ ๋ช…์‚ฌ(nouns)๋กœ, ๊ณ ์œ ํ•œ ๊ณต๊ฐ„์  ๋ฒ”์œ„๋‚˜ ์ •์ฒด์„ฑ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ๋“ค์ž…๋‹ˆ๋‹ค.
      • ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐœ, ๊ณ ์–‘์ด์™€ ๊ฐ™์€ ๋™๋ฌผ, ๋ณ‘, ์ž๋™์ฐจ ๊ฐ™์€ ๋ฌด์ƒ๋ฌผ ๊ฐ์ฒด ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
      • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์˜ ๋ชฉํ‘œ๋Š” ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์—์„œ ์ด์™€ ๊ฐ™์€ ๊ฐ์ฒด๋ฅผ ์ธ์‹ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋น„๋””์˜ค ๋ถ„๋ฅ˜ task:

    • ๋™์ž‘ ๋˜๋Š” ํ™œ๋™ ์ธ์‹์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๋น„๋””์˜ค์—์„œ ์ธ์‹ํ•˜๊ณ ์ž ํ•˜๋Š” ๋Œ€์ƒ์€ ์ฃผ๋กœ ๋™์‚ฌ(verbs)๋กœ, ์‹œ๊ฐ„ ์ถ•์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ–‰๋™์ด๋‚˜ ํ™œ๋™์ž…๋‹ˆ๋‹ค.
      • ์˜ˆ๋ฅผ ๋“ค์–ด, ์ˆ˜์˜, ๋‹ฌ๋ฆฌ๊ธฐ, ์ ํ”„, ๋จน๊ธฐ, ์„œ ์žˆ๊ธฐ ๋“ฑ์˜ ๋™์ž‘์ด ์žˆ์Šต๋‹ˆ๋‹ค.
      • ๋น„๋””์˜ค ๋ถ„๋ฅ˜์˜ ๋ชฉํ‘œ๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ™”ํ•˜๋Š” ํ–‰๋™์„ ์ธ์‹ํ•˜๊ณ , ์ด๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ณ„์‚ฐ ๋ณต์žก์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ:

    • ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋Š” ๋งค์šฐ ํฌ๊ธฐ ๋•Œ๋ฌธ์— GPU ๋ฉ”๋ชจ๋ฆฌ์— ์ ์žฌํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

      • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋น„๋””์˜ค ์ŠคํŠธ๋ฆผ์„ 30fps๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณ ํ•ด์ƒ๋„๋กœ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ์—„์ฒญ๋‚œ ์–‘์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

    • ์ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ”„๋ ˆ์ž„ ์†๋„๋ฅผ ์ค„์ด๊ฑฐ๋‚˜ ํ•ด์ƒ๋„๋ฅผ ๋‚ฎ์ถ”๋Š” ๋“ฑ์˜ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

      • ์˜ˆ๋ฅผ ๋“ค์–ด, ์งง์€ ๋น„๋””์˜ค ํด๋ฆฝ(3~5์ดˆ)์„ ์‚ฌ์šฉํ•˜๊ณ , ํ•ด์ƒ๋„์™€ ํ”„๋ ˆ์ž„ ์†๋„๋ฅผ ์ค„์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ž…๋‹ˆ๋‹ค.

3. ๋น„๋””์˜ค ๋ถ„๋ฅ˜ ๋ชจ๋ธ

  • ๋‹จ์ผ ํ”„๋ ˆ์ž„ CNN ๋ถ„๋ฅ˜๊ธฐ(Single Frame CNN):

    • ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์ ‘๊ทผ๋ฒ•์œผ๋กœ, ๋น„๋””์˜ค์˜ ๊ฐ ํ”„๋ ˆ์ž„์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ถ„๋ฅ˜ํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ํ™”ํ•ด ์ตœ์ข… ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • ์ด ์ ‘๊ทผ๋ฒ•์€ ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ๋ฌด์‹œํ•˜๋ฏ€๋กœ ๋‹จ์ˆœํ•ด ๋ณด์ด์ง€๋งŒ, ์‹ค์งˆ์ ์œผ๋กœ ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.
    • ํŠนํžˆ, ๋ณต์žกํ•œ ๋น„๋””์˜ค ์ž‘์—…์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ๋‹ค๋ฅธ ๋ณต์žกํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋น„๊ต ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. (๋ณดํ†ต baseline์œผ๋กœ ๋งŽ์ด ์‚ฌ์šฉํ•จ)

  • ์ง€์—ฐ ์œตํ•ฉ(Late Fusion):

    • ๋‹จ์ผ ํ”„๋ ˆ์ž„ ๋ถ„๋ฅ˜๊ธฐ์™€ ์œ ์‚ฌํ•˜๋‚˜, ๊ฐ ํ”„๋ ˆ์ž„์˜ ๊ฒฐ๊ณผ๋ฅผ ๋„คํŠธ์›Œํฌ ๋‚ด๋ถ€์—์„œ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, CNN์„ ํ†ตํ•ด ๊ฐ ํ”„๋ ˆ์ž„์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•œ ํ›„, ๋‚˜์ค‘์— ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    • ์ด ์ ‘๊ทผ๋ฒ•์€ ์‹œ๊ฐ„ ์ถ• ์ •๋ณด๋ฅผ ๋„คํŠธ์›Œํฌ ๋‚ด์—์„œ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•˜์—ฌ, ๋ณด๋‹ค ์ •๊ตํ•œ ์‹œ๊ฐ„์  ํŒจํ„ด์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ดˆ๊ธฐ ์œตํ•ฉ(Early Fusion):

    • ์ž…๋ ฅ ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„ ์ถ•์„ ์ฑ„๋„ ์ถ•์œผ๋กœ ์žฌํ•ด์„ํ•˜๊ณ , ์ฒซ ๋ฒˆ์งธ CNN ๋ ˆ์ด์–ด์—์„œ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
      • Reshape ์ˆ˜ํ–‰ : (T x 3 x H x W) โ–ถ (3T x H x W)
    • ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ CNN์˜ ์ดˆ๊ธฐ ๋ ˆ์ด์–ด์—์„œ ์‹œ๊ฐ„ ์ถ•์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‚ฎ์€ ์ˆ˜์ค€์˜ ์‹œ๊ฐ„์  ์ƒํ˜ธ์ž‘์šฉ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ํ•œ ๋ฒˆ์— ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • 3D CNN (Slow Fusion):

    • 3D CNN์„ ์‚ฌ์šฉํ•ด ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ์—ฌ๋Ÿฌ ์ธต์— ๊ฑธ์ณ ์ ์ง„์ ์œผ๋กœ ์œตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

    • CNN์˜ ๊ฐ ์ธต์—์„œ 3D ์ปจ๋ณผ๋ฃจ์…˜๊ณผ 3D ํ’€๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐ„์ , ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ๋А๋ฆฌ์ง€๋งŒ ์ง€์†์ ์œผ๋กœ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ด ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๋ฉฐ, ํŠนํžˆ ์›€์ง์ž„์ด ์ค‘์š”ํ•œ ๋น„๋””์˜ค์—์„œ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.


  • Summary:

    ํ•ด๋‹น ํ…Œ์ด๋ธ”์€ Late Fusion, Early Fusion, 3D CNN์˜ ๊ตฌ์กฐ์  ์ฐจ์ด์ ์„ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ…Œ์ด๋ธ”์€ ๊ฐ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ ํฌ๊ธฐ, ์ˆ˜์šฉ ์˜์—ญ(Receptive Field), ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ์ด ์‹œ๊ฐ„๊ณผ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  1. Late Fusion

    • ๊ตฌ์กฐ:
      • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” 3 x 20 x 64 x 64 ํ…์„œ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 3์€ ์ฑ„๋„ ์ˆ˜(RGB), 20์€ ํ”„๋ ˆ์ž„ ์ˆ˜(Time), 64 x 64๋Š” ๊ณต๊ฐ„์  ํ•ด์ƒ๋„(Image Size)์ž…๋‹ˆ๋‹ค.
      • ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋Š” 2D Conv (3x3 ํ•„ํ„ฐ, ์ถœ๋ ฅ ์ฑ„๋„ 12๊ฐœ)๋กœ, ๊ฐ ํ”„๋ ˆ์ž„์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ ˆ์ด์–ด๋Š” ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ๊ณต๊ฐ„์  ํŠน์ง•๋งŒ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
      • ์ดํ›„ ํ’€๋ง(Pooling)์„ ํ†ตํ•ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์ถ•์†Œํ•œ ๋’ค, ๋˜ ๋‹ค๋ฅธ 2D Conv ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ณต๊ฐ„์  ์ˆ˜์šฉ ์˜์—ญ์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.
      • ๋งˆ์ง€๋ง‰์œผ๋กœ Global Average Pooling ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. (Late Fusion)
    • ํŠน์ง•:
      • ์‹œ๊ฐ„ ์ •๋ณด ๊ฒฐํ•ฉ: ์ „์ฒด ๋„คํŠธ์›Œํฌ์—์„œ ์‹œ๊ฐ„ ์ •๋ณด๋Š” ๋งˆ์ง€๋ง‰ Global Average Pooling์—์„œ ํ•œ ๋ฒˆ์— ๊ฒฐํ•ฉ๋ฉ๋‹ˆ๋‹ค.
      • ๊ณต๊ฐ„ ์ •๋ณด ๊ฒฐํ•ฉ: ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด์— ๊ฑธ์ณ ์ฒœ์ฒœํžˆ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  2. Early Fusion

    • ๊ตฌ์กฐ:
      • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” 3 x 20 x 64 x 64 ํ…์„œ์ž…๋‹ˆ๋‹ค.
      • ์ฒซ ๋ฒˆ์งธ 2D Conv ๋ ˆ์ด์–ด์—์„œ ์‹œ๊ฐ„ ์ถ•(T)์„ ์ฑ„๋„ ์ถ•(C)์— ํ•ฉ์นœ ํ›„(์˜ˆ: 3 ์ฑ„๋„์ด ์•„๋‹Œ 3x20=60 ์ฑ„๋„), ์ด๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. (Early Fusion)
      • ์ดํ›„ ํ’€๋ง(Pooling)๊ณผ 2D Conv ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , Global Average Pooling์—์„œ ์ตœ์ข… ์ถœ๋ ฅ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • ํŠน์ง•:
      • ์‹œ๊ฐ„ ์ •๋ณด ๊ฒฐํ•ฉ: ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
      • ๊ณต๊ฐ„ ์ •๋ณด ๊ฒฐํ•ฉ: ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•œ ํ›„ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์ณ ์ฒœ์ฒœํžˆ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
      • ๋‹จ์ : ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ๊ฒฐํ•ฉํ•˜๋ฏ€๋กœ, ์ดˆ๊ธฐ์˜ ์ •๋ณด ์†์‹ค ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  3. 3D CNN (Slow Fusion)

    • ๊ตฌ์กฐ:
      • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” 3 x 20 x 64 x 64 ํ…์„œ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
      • ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋Š” 3D Conv (3x3x3 ํ•„ํ„ฐ, ์ถœ๋ ฅ ์ฑ„๋„ 12๊ฐœ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์‹œ๊ฐ„ ๋ฐ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
      • ์ดํ›„์˜ ํ’€๋ง๊ณผ 3D Conv ๋ ˆ์ด์–ด๋„ ๋™์ผํ•˜๊ฒŒ 3์ฐจ์› ๊ณต๊ฐ„(2D ๊ณต๊ฐ„ + ์‹œ๊ฐ„ ์ถ•)์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์ฒœ์ฒœํžˆ ์‹œ๊ฐ„ ๋ฐ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    • ํŠน์ง•:
      • ์‹œ๊ฐ„ ๋ฐ ๊ณต๊ฐ„ ์ •๋ณด ๊ฒฐํ•ฉ: ๋„คํŠธ์›Œํฌ ์ „์ฒด์—์„œ ์‹œ๊ฐ„๊ณผ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์ฒœ์ฒœํžˆ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์„ โ€œSlow Fusionโ€์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
      • ์žฅ์ : ๊ณต๊ฐ„๊ณผ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๋™์‹œ์— ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ณต์žกํ•œ ์‹œ๊ฐ„์  ํŒจํ„ด์„ ๋‹ค๋ฃฐ ๋•Œ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

โœ… Early Fusion vs 3D CNN

1. Conv2D(3x3, 3*20->12):

  • ์—ฌ๊ธฐ์„œ 3x3์€ ํ•„ํ„ฐ์˜ ํฌ๊ธฐ์ด๋ฉฐ, 3*20์€ ๊ฒฐํ•ฉ๋œ ์ž…๋ ฅ ์ฑ„๋„์˜ ์ˆ˜(3 RGB ์ฑ„๋„ x 10 ํ”„๋ ˆ์ž„ = 30), ->12์—์„œ 12๋Š” ์ถœ๋ ฅ ์ฑ„๋„(ํ•„ํ„ฐ)์˜ ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ์ถœ๋ ฅ ์ฑ„๋„์ด 12๊ฐœ๋ผ๋Š” ๊ฒƒ์€, ์ด ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๊ฐ€ 12๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ•„ํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ณ , ๊ฐ๊ฐ์˜ ํ•„ํ„ฐ๊ฐ€ 3x3 ํฌ๊ธฐ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • ๋ชจ๋“  ํ”„๋ ˆ์ž„์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ์‹œ๊ฐ„ ์ถ•์˜ ์ •๋ณด๋ฅผ ๋‹จ์ผ ๋ ˆ์ด์–ด์—์„œ ์™„์ „ํžˆ ๊ฒฐํ•ฉํ•ด ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.

    2. Conv3D(3x3x3, 3 -> 12):

  • ์ด ๊ฒฝ์šฐ, 3x3x3 ํ•„ํ„ฐ๊ฐ€ 3D ๊ณต๊ฐ„(์‹œ๊ฐ„ ํฌํ•จ)์„ ํ†ตํ•ด ์ด๋™ํ•˜๋ฉด์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ž…๋ ฅ ์ฑ„๋„์ด 3๊ฐœ์ด๋ฏ€๋กœ, ์ด ๋ ˆ์ด์–ด์—๋Š” 12๊ฐœ์˜ ํ•„ํ„ฐ๊ฐ€ ๊ฐ 3๊ฐœ์˜ ์ž…๋ ฅ ์ฑ„๋„์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋„ ์ถœ๋ ฅ ์ฑ„๋„์˜ ์ˆ˜๋Š” 12๋กœ, ์ด๋Š” ์„ค๊ณ„์ž๊ฐ€ ์ง€์ •ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ํ•„ํ„ฐ๊ฐ€ ์‹œ๊ฐ„ ์ถ•์„ ๋”ฐ๋ผ ์Šฌ๋ผ์ด๋“œํ•˜๋ฉด์„œ ์‹œ๊ฐ„์  ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

4. ์ถ”๊ฐ€์  ์ ‘๊ทผ๋ฒ•

  • ์˜ˆ์ œ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹: Sports-1M:

    • Google์—์„œ ์ œ์ž‘ํ•œ Sports-1M ๋ฐ์ดํ„ฐ์…‹์€ 1๋ฐฑ๋งŒ ๊ฐœ์˜ YouTube ์Šคํฌ์ธ  ๋น„๋””์˜ค๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, ๋‹ค์–‘ํ•œ ์Šคํฌ์ธ  ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ ˆ์ด๋ธ”๋ง๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ด ๋ฐ์ดํ„ฐ์…‹์€ ๋น„๋””์˜ค ๋ถ„๋ฅ˜ ์ž‘์—…์˜ ๋„์ „์„ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํŠนํžˆ, ๋‹จ์ผ ํ”„๋ ˆ์ž„ ๋ถ„๋ฅ˜๊ธฐ, Late Fusion, Early Fusion, 3D CNN ๋“ฑ์˜ ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ๋น„๋””์˜ค ๋ถ„๋ฅ˜์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ๋‹จ์ผ ํ”„๋ ˆ์ž„ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ 77% ์ด์ƒ์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ ๋‹จ์ˆœํ•œ ์ ‘๊ทผ๋ฒ•์ด ๋งค์šฐ ๊ฐ•๋ ฅํ•จ์„ ์ž…์ฆํ–ˆ์œผ๋‚˜, Late Fusion๊ณผ 3D CNN์€ ์กฐ๊ธˆ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

  • C3D: 3D CNN์˜ VGGNet:

    • C3D๋Š” VGG ๋„คํŠธ์›Œํฌ์™€ ์œ ์‚ฌํ•˜๊ฒŒ 3x3x3 ์ปจ๋ณผ๋ฃจ์…˜๊ณผ 2x2x2 ํ’€๋ง์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋‹จ์ˆœํ•œ 3D CNN ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.

    • ์ด ๋ชจ๋ธ์€ Sports-1M์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ๋งŽ์€ ๋น„๋””์˜ค ์ธ์‹ ์ž‘์—…์—์„œ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

    • ๊ทธ๋Ÿฌ๋‚˜ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งค์šฐ ๋†’์•„ ์‹คํ–‰ํ•˜๊ธฐ ์–ด๋ ค์šด ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. (3x3x3 conv is very expensive!)

      • AlexNet: 0.7 GFLOP
      • VGG-16: 13.6 GFLOP
      • C3D: 39.5 GFLOP (VGG์˜ ์•ฝ 2.9๋ฐฐ!!)

๐Ÿ”Ž Measuring Motion์„ ์ธก์ •ํ•  ๋ฐฉ๋ฒ•์ด ์žˆ์„๊นŒ? => Optical Flow

โœ๏ธ Optical Flow๋Š” ์—ฐ์†๋œ ์ด๋ฏธ์ง€ ํ”„๋ ˆ์ž„์—์„œ ๊ฐ ํ”ฝ์…€์˜ ์›€์ง์ž„์„ ์ถ”์ •ํ•˜๋Š” ๊ธฐ์ˆ ๋กœ, ์ด๋ฏธ์ง€์—์„œ์˜ ์›€์ง์ž„์„ ๊ฐ์ง€ํ•˜๊ณ  ๊ทธ ๋ฐฉํ–ฅ๊ณผ ์†๋„๋ฅผ ๋ฒกํ„ฐ ํ•„๋“œ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ์ฃผ๋กœ ๋น„๋””์˜ค ์ฒ˜๋ฆฌ, ๋™์ž‘ ์ธ์‹, ๋น„๋””์˜ค ์•ˆ์ •ํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ๊ธฐ๋ณธ ๊ฐœ๋…:
    • Optical Flow๋Š” ํŠน์ • ์‹œ๊ฐ„ ttt์—์„œ์˜ ์ด๋ฏธ์ง€ ItI_tItโ€‹์™€ ๋‹ค์Œ ์‹œ๊ฐ„ t+1t+1t+1์—์„œ์˜ ์ด๋ฏธ์ง€ It+1I_{t+1}It+1โ€‹ ์‚ฌ์ด์—์„œ ๊ฐ ํ”ฝ์…€์ด ์–ด๋–ป๊ฒŒ ์ด๋™ํ–ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ ํ•„๋“œ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • Optical Flow๋Š” ๊ฐ ํ”ฝ์…€์˜ ์ด๋™์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ F(x,y)=(dx,dy)F(x, y) = (dx, dy)F(x,y)=(dx,dy)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ฒกํ„ฐ๋Š” ํ”ฝ์…€์ด ํ”„๋ ˆ์ž„ ttt์—์„œ t+1t+1t+1๋กœ ์ด๋™ํ•  ๋•Œ ์–ผ๋งˆ๋‚˜ ์›€์ง์˜€๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์ฃผ์š” ํ™œ์šฉ:
    1. ์›€์ง์ž„ ๊ฐ์ง€: ๋น„๋””์˜ค์—์„œ ํŠน์ • ๋ฌผ์ฒด๊ฐ€ ์–ด๋–ป๊ฒŒ ์›€์ง์ด๋Š”์ง€ ์ถ”์ ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    2. ๋™์ž‘ ๋ถ„์„: ์‚ฌ๋žŒ์˜ ๋™์ž‘์„ ๋ถ„์„ํ•˜๊ณ  ํŠน์ • ํ–‰๋™์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    3. ๋น„๋””์˜ค ์•ˆ์ •ํ™”: ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋ณด์ •ํ•˜์—ฌ ๋น„๋””์˜ค๋ฅผ ์•ˆ์ •ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ณ„์‚ฐ ๊ณผ์ •:
    1. ํ”„๋ ˆ์ž„ ์„ ํƒ: ์—ฐ์†๋œ ๋‘ ํ”„๋ ˆ์ž„์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
    2. ๋ฐ๊ธฐ ์ฐจ์ด ๊ณ„์‚ฐ: ํ”ฝ์…€์˜ ๋ฐ๊ธฐ ๋ณ€ํ™”๊ฐ€ ์—†๋‹ค๋Š” ๊ฐ€์ •ํ•˜์—, ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ํ”ฝ์…€ ์œ„์น˜ ๋ณ€ํ™”๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
      • ์ˆ˜์‹: I(x,y,t)=I(x+dx,y+dy,t+1)I(x, y, t) = I(x + dx, y + dy, t + 1)I(x,y,t)=I(x+dx,y+dy,t+1)
    3. ๋ฒกํ„ฐ ํ•„๋“œ ์ƒ์„ฑ: ๊ณ„์‚ฐ๋œ ์ด๋™ ๋ฒกํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ „์ฒด ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ์›€์ง์ž„ ๋ฒกํ„ฐ ํ•„๋“œ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์‹œ๊ฐํ™”:

    Optical Flow๋Š” ์ˆ˜ํ‰ ์ด๋™ dxdxdx์™€ ์ˆ˜์ง ์ด๋™ dydydy๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋™์˜์ƒ์—์„œ ๋ฌผ์ฒด์˜ ์›€์ง์ž„์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Separating Motion and Appearance: Two-Stream Networks:

    • Two-Stream Networks๋Š” ๋™์ž‘ ์ธ์‹๊ณผ ๊ฐ™์€ ๋น„๋””์˜ค ๋ถ„์„ ์ž‘์—…์—์„œ โ€œMotionโ€๊ณผ โ€œAppearanceโ€๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
    • ์—ฌ๊ธฐ์„œ Optical Flow๋Š” Temporal Stream์—์„œ ์›€์ง์ž„ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

    • Two-Stream Network ๊ตฌ์กฐ:

      ๋‘ ์ŠคํŠธ๋ฆผ์ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ๊ฐ๊ฐ์˜ CNN์€ ์ž์‹ ์—๊ฒŒ ์ฃผ์–ด์ง„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์— ๋งž๋Š” ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‘ ๋„คํŠธ์›Œํฌ๊ฐ€ ์„œ๋กœ ๋ณด์™„์ ์ธ ์ •๋ณด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ์ค‘์š”ํ•œ ์š”์†Œ์ž…๋‹ˆ๋‹ค.

      • Spatial Stream: ๋‹จ์ผ ํ”„๋ ˆ์ž„์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ, ์ •์ ์ธ ์™ธํ˜• ์ •๋ณด(๊ฐ์ฒด์˜ ๋ชจ์–‘, ์ƒ‰์ƒ, ๋ฐฐ๊ฒฝ ๋“ฑ)๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ์ŠคํŠธ๋ฆผ์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ „ํ†ต์ ์ธ CNN ์•„ํ‚คํ…์ฒ˜์™€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
      • Temporal Stream: ์—ฐ์†๋œ ํ”„๋ ˆ์ž„๋“ค ๊ฐ„์˜ Optical Flow๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ, ๋น„๋””์˜ค ๋‚ด์˜ ์›€์ง์ž„ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. Optical Flow๋Š” ํŠน์ • ํ”„๋ ˆ์ž„ ์‚ฌ์ด์˜ ์›€์ง์ž„ ๋ฒกํ„ฐ ํ•„๋“œ์ด๋ฏ€๋กœ, ์›€์ง์ž„์˜ ๋ฐฉํ–ฅ์„ฑ๊ณผ ์†๋„ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Two-Stream Network ๊ณ„์‚ฐ:

      1. ๋น„๋””์˜ค์—์„œ ์—ฌ๋Ÿฌ ์—ฐ์†๋œ ํ”„๋ ˆ์ž„์„ ๊ฐ€์ ธ์˜ด.
      2. ์—ฐ์†๋œ ๋‘ ํ”„๋ ˆ์ž„ ์‚ฌ์ด์—์„œ Optical Flow๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์›€์ง์ž„ ๋ฒกํ„ฐ ํ•„๋“œ๋ฅผ ์ƒ์„ฑ.
      3. ๊ณ„์‚ฐ๋œ Optical Flow๋ฅผ Temporal Stream์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ.
      4. Spatial Stream๊ณผ Temporal Stream์˜ ์ถœ๋ ฅ์€ ๊ฒฐํ•ฉ(Fusion)๋˜์–ด ์ตœ์ข… ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธก.

๐Ÿค” Q. ์™œ Optical Flow์˜ Input์€ 2(T-1)์ธ๊ฐ€

Optical Flow ์ž…๋ ฅ์˜ ๊ณ„์‚ฐ ๋ฐฉ์‹

  • T๋Š” ๋น„๋””์˜ค์—์„œ ์„ ํƒ๋œ ์—ฐ์†๋œ ํ”„๋ ˆ์ž„์˜ ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • T-1์€ Optical Flow ๋ฒกํ„ฐ ํ•„๋“œ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ํ”„๋ ˆ์ž„ ์Œ์˜ ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, T๊ฐœ์˜ ํ”„๋ ˆ์ž„์ด ์žˆ๋‹ค๋ฉด, ๊ทธ ์ค‘ ๋‘ ํ”„๋ ˆ์ž„์”ฉ ์ง์„ ์ง€์–ด T-1๊ฐœ์˜ Optical Flow๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Optical Flow๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ ์„ฑ๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: ์ˆ˜ํ‰(x) ๋ฐฉํ–ฅ ์„ฑ๋ถ„๊ณผ ์ˆ˜์ง(y) ๋ฐฉํ–ฅ ์„ฑ๋ถ„. ๋”ฐ๋ผ์„œ ๊ฐ ํ”„๋ ˆ์ž„ ์Œ์—์„œ ๋‘ ๊ฐœ์˜ ์ฑ„๋„์ด ์ƒ์„ฑ๋˜๋ฏ€๋กœ, T-1๊ฐœ์˜ ํ”„๋ ˆ์ž„ ์Œ์—์„œ ์ด 2*(T-1)๊ฐœ์˜ ์ฑ„๋„์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
    • 2: Optical Flow๊ฐ€ ๋‘ ๊ฐœ์˜ ์ฑ„๋„(์ˆ˜ํ‰ ๋ฐ ์ˆ˜์ง ์„ฑ๋ถ„)๋กœ ๊ตฌ์„ฑ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
    • (T-1): ์—ฐ์†๋œ ๋‘ ํ”„๋ ˆ์ž„ ์‚ฌ์ด์—์„œ Optical Flow๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ํ”„๋ ˆ์ž„ ์Œ์˜ ์ˆ˜์ž…๋‹ˆ๋‹ค.


  • Recurrent Neural Network (RNN):

    • ์žฅ์ :

      • ์žฅ๊ธฐ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ์— ๊ฐ•ํ•จ: RNN์€ ์ด์ „ ์‹œ๊ฐ„ ๋‹จ๊ณ„์˜ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ๋ฅผ ๊ธฐ์–ตํ•˜๋ฉฐ, ์ด๋Š” ๊ธด ์‹œํ€€์Šค์—์„œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•œ ์ธต์˜ RNN ๋ ˆ์ด์–ด๊ฐ€ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ โ€˜๋ณผโ€™ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ๋‹จ์ :

      • ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€๋Šฅ: RNN์€ ์‹œํ€€์Šค์˜ ๊ฐ ์‹œ๊ฐ„ ๋‹จ๊ณ„์—์„œ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์— ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
    • ๋น„๋””์˜ค ์ž‘์—…์—์„œ์˜ ํ™œ์šฉ:

      • CNN๊ณผ RNN์„ ๊ฒฐํ•ฉํ•˜๊ฑฐ๋‚˜ Recurrent CNN์„ ์‚ฌ์šฉํ•ด ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ๊ธฐ์กด RNN๊ณผ ์ฐจ์ด: Recurrent Convolutional Network (RCN)์—์„œ ์ •๋ณด๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜๋Š” ๊ณผ์ •์€ ๊ธฐ์กด RNN๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์ค‘์š”ํ•œ ์ฐจ์ด์ ์€ ํ–‰๋ ฌ ๊ณฑ์…ˆ (matmul) ๋Œ€์‹  2D ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ(Convolution)์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.


  • 1D Convolution:

    • ์žฅ์ :
      • ๊ณ ๋„๋กœ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅ: ์‹œํ€€์Šค์˜ ๊ฐ ์ถœ๋ ฅ์ด ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์–ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ ํ•™์Šต๊ณผ ์ถ”๋ก  ์†๋„์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹จ์ :
      • ์žฅ๊ธฐ ์‹œํ€€์Šค์— ๋ถˆ๋ฆฌํ•จ: ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ๋งŽ์€ ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๋ฅผ ์Œ“์•„์•ผ ํ•˜๋ฉฐ, ์ด๋Š” ๋ณต์žก์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค. ์žฅ๊ธฐ์ ์ธ ์‹œ๊ฐ„ ์˜์กด์„ฑ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋น„๋””์˜ค ์ž‘์—…์—์„œ์˜ ํ™œ์šฉ:
      • 3D Convolution์„ ํ†ตํ•ด ๋น„๋””์˜ค์˜ ๊ณต๊ฐ„์ (Spatial) ๋ฐ ์‹œ๊ฐ„์ (Temporal) ์ •๋ณด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • Self-Attention:

    • ์žฅ์ :

      • ์žฅ๊ธฐ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ์— ๊ฐ•ํ•จ: Self-Attention์€ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์ž…๋ ฅ ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด, ๊ฐ ์ถœ๋ ฅ์ด ํ•œ ๋ฒˆ์˜ ๊ณ„์‚ฐ์œผ๋กœ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ โ€˜๋ณผโ€™ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์žฅ๊ธฐ์ ์ธ ์‹œ๊ฐ„ ์˜์กด์„ฑ์„ ์ž˜ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
      • ๊ณ ๋„๋กœ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅ: ๋ชจ๋“  ์ถœ๋ ฅ์ด ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์–ด, ๊ณ„์‚ฐ ํšจ์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ :

      • ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ ํผ: ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์Œ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๊ฐ€ ํฌ๊ณ , ํŠนํžˆ ๊ธด ์‹œํ€€์Šค์—์„œ๋Š” ๋ถ€๋‹ด์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋น„๋””์˜ค ์ž‘์—…์—์„œ์˜ ํ™œ์šฉ:

      • Self-Attention ๊ธฐ๋ฒ•์€ ๋น„๋””์˜ค ๋ถ„์„์—์„œ ์žฅ๊ธฐ์ ์ธ ์‹œ๊ฐ„ ์˜์กด์„ฑ๊ณผ ๋ณต์žกํ•œ ๊ณต๊ฐ„์  ๊ด€๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

5. ๋ชจ๋ธ ์ตœ์ ํ™” ๋ฐ ์ตœ์‹  ๊ธฐ์ˆ 

Spatio-Temporal Self-Attention (Nonlocal Block)

Nonlocal Block

  • Nonlocal Block:

    • 3D CNN ๊ตฌ์กฐ ๋‚ด์— ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” Nonlocal Block ๋ธ”๋ก์„ ํ†ตํ•ด, ๊ณต๊ฐ„์  ๋ฐ ์‹œ๊ฐ„์  ์ฐจ์›์—์„œ ๋ชจ๋“  ์œ„์น˜ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธ๋งํ•˜์—ฌ ์žฅ๊ธฐ์ ์ธ ์˜์กด์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์˜ ์ธ์‹ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Nonlocal Block ๊ตฌ์กฐ:

    • Nonlocal Block: Nonlocal Block์€ 1x1x1 ์ปจ๋ณผ๋ฃจ์…˜์„ ์‚ฌ์šฉํ•ด ์ž…๋ ฅ ํ…์„œ์—์„œ ์ฟผ๋ฆฌ(Query), ํ‚ค(Key), ๊ฐ’(Value)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฟผ๋ฆฌ์™€ ํ‚ค ๊ฐ„์˜ ์ ๊ณฑ(Dot Product)์„ ํ†ตํ•ด Attention ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ, ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž…๋ ฅ ํ…์„œ๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • Residual Connection: Nonlocal Block์€ Residual Connection์„ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ํ•™์Šต ์—†์ด๋„ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด 3D CNN ์•„ํ‚คํ…์ฒ˜์— ์‰ฝ๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋ฅผ ์„ค์ •ํ•  ๋•Œ ๋งˆ์ง€๋ง‰ 1x1x1 ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ด ๋ธ”๋ก์„ ์ฒ˜์Œ์—๋Š” ํ•ญ๋“ฑ ํ•จ์ˆ˜๋กœ ์ž‘๋™ํ•˜๋„๋ก ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ Nonlocal Block์„ ์ ์ง„์ ์œผ๋กœ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Nonlocal Block ์—ญํ• :

    • ๊ธ€๋กœ๋ฒŒ ๋ฌธ๋งฅ ์ดํ•ด: Non-local block์€ ๋น„๋””์˜ค ์ „์ฒด์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์žฅ๊ธฐ์ ์ธ ์‹œ๊ณต๊ฐ„์  ์˜์กด์„ฑ์„ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค.

      • ๋น„๋””์˜ค์˜ ์–ด๋А ํ•œ ๋ถ€๋ถ„์—์„œ์˜ ๋ณ€ํ™”๊ฐ€ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ „์—ญ์ ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ํ”„๋ ˆ์ž„ ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋น„๋””์˜ค์˜ ์ดˆ๋ฐ˜์— ๋ฐœ์ƒํ•œ ๋™์ž‘์ด ํ›„๋ฐ˜๋ถ€์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ „์—ญ์ ์ธ ํŠน์ง• ํ•™์Šต: Non-local block์€ ๋ชจ๋“  ์œ„์น˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ „์ฒด ๋น„๋””์˜ค์˜ ์ „์—ญ์ ์ธ ํŠน์ง•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

      • ์ด๋Š” CNN์ด ๊ฐ€์ง€๋Š” ์ง€์—ญ์  ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
      • ์ฃผ๋กœ ํŠน์ • ํ”„๋ ˆ์ž„ ๋‚ด ๋˜๋Š” ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ์ „์—ญ์ ์ธ ํŒจํ„ด(์˜ˆ: ์ „์ฒด์ ์ธ ์›€์ง์ž„ ๊ถค์ , ์žฅ๊ธฐ์ ์ธ ํ–‰๋™)์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

3D-CNN์ด ์™œ ์žˆ์ง€?

์œ„ ๊ทธ๋ฆผ์— ๋‚˜์™€์žˆ๋Š” 3D CNN์€ ์•„๋ž˜์™€ ๊ฐ™์€ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ๋กœ์ปฌํ•œ ์‹œ๊ณต๊ฐ„ ํŠน์ง• ์ถ”์ถœ:
    • 3D-CNN์€ ์—ฐ์†๋œ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์—์„œ ์งง์€ ์‹œ๊ฐ„ ๋‚ด์˜ ์›€์ง์ž„ ๋ฐ ๊ณต๊ฐ„์  ํŒจํ„ด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ์†์„ ํ”๋“œ๋Š” ๋™์ž‘์ด๋‚˜ ๋ฌผ์ฒด์˜ ์ž‘์€ ์ด๋™๊ณผ ๊ฐ™์€ ์งง์€ ์‹œ๊ฐ„ ๋™์•ˆ ๋ฐœ์ƒํ•˜๋Š” ์›€์ง์ž„์„ ์ž˜ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • CNN์˜ ํ•„ํ„ฐ๋Š” ์ง€์—ญ์ ์ธ ํŠน์ง•์„ ํƒ์ง€ํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ์ง€์—ญ์  ํŠน์ง•์€ ํ•ฉ์„ฑ๊ณฑ๊ณผ ํ’€๋ง์„ ํ†ตํ•ด ์ ์ง„์ ์œผ๋กœ ๋” ๋†’์€ ์ˆ˜์ค€์˜ ์ถ”์ƒ์  ํŠน์ง•์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
  • ๊ณ„์ธต์ (hierarchical) ์ •๋ณด ์ฒ˜๋ฆฌ:
    • 3D-CNN์€ ์—ฌ๋Ÿฌ ๊ณ„์ธต(layer)์„ ํ†ตํ•ด ๋กœ์šฐ๋ ˆ๋ฒจ(low-level)์—์„œ ํ•˜์ด๋ ˆ๋ฒจ(high-level)๊นŒ์ง€ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ ์ €์ˆ˜์ค€ ํŠน์ง•(์˜ˆ: ์—์ง€, ์ฝ”๋„ˆ)๋ถ€ํ„ฐ ๊ณ ์ˆ˜์ค€์˜ ์˜๋ฏธ์  ์ •๋ณด(์˜ˆ: ๊ฐ์ฒด, ์žฅ๋ฉด)๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ Spatio-Temporal Self-Attention (Nonlocal Block)

์€ 3D-CNN๊ณผ Non-local block์˜ ์กฐํ•ฉ์„ ํ†ตํ•ด ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์ธต์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋กœ์ปฌ ์ •๋ณด์™€ ๊ธ€๋กœ๋ฒŒ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ๋น„๋””์˜ค ์ดํ•ด์˜ ์ •ํ™•์„ฑ์„ ๋†’ํžˆ๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค์˜ ๋‹ค์–‘ํ•œ ์‹œ๊ณต๊ฐ„์  ํŒจํ„ด์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


Inflated 2D Networks, I3D (2D ๋„คํŠธ์›Œํฌ์˜ 3D ํ™•์žฅ)

  • 2D -> 3D ํ™•์žฅ: I3D ๋ชจ๋ธ์˜ ํ•ต์‹ฌ์€ 2D CNN์˜ ๊ณต๊ฐ„์  ํ•„ํ„ฐ๋ฅผ ์‹œ๊ฐ„ ์ถ•์œผ๋กœ ํ™•์žฅํ•˜์—ฌ 3D CNN์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ™•์žฅ์€ ๋‹จ์ˆœํžˆ ํ•„ํ„ฐ๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‹œ๊ฐ„ ์ถ•์„ ๊ณ ๋ คํ•ด ํ•™์Šตํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, ์˜์ƒ์˜ ์‹œ๊ฐ„์  ๋ณ€ํ™”๋ฅผ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๊ธฐ์กด์˜ 2D CNN์—์„œ๋Š” ์ปค๋„์ด Cin ร— Kh ร— Kw ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ Cin์€ ์ž…๋ ฅ ์ฑ„๋„ ์ˆ˜, Kh์™€ Kw๋Š” ๊ฐ๊ฐ ํ•„ํ„ฐ์˜ ๋†’์ด์™€ ๋„ˆ๋น„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
    • ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, ์ด ํ•„ํ„ฐ๋ฅผ ์‹œ๊ฐ„ ์ถ•์œผ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, Kt๋ผ๋Š” ์‹œ๊ฐ„ ์ถ• ํ•„ํ„ฐ ํฌ๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ 3D ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ Cin ร— Kt ร— Kh ร— Kw๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  • ์ „์ด ํ•™์Šต(Transfer Learning): I3D๋Š” ImageNet๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ 2D CNN์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์—์„œ 3D CNN์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ , ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ€์ค‘์น˜ ๋ณต์‚ฌ: 2D CNN์—์„œ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜(ํ•„ํ„ฐ)๋ฅผ ์‹œ๊ฐ„ ์ถ•์œผ๋กœ ๋ณต์‚ฌํ•˜์—ฌ 3D CNN์˜ ๊ฐ€์ค‘์น˜๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 2D CNN์˜ ํ•„ํ„ฐ๊ฐ€ Kh ร— Kw ํฌ๊ธฐ๋ผ๋ฉด, ์ด๋ฅผ Kt๋ฒˆ ๋ณต์‚ฌํ•˜์—ฌ Kt ร— Kh ร— Kw ํฌ๊ธฐ์˜ 3D ํ•„ํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ€์ค‘์น˜ ๋‚˜๋ˆ„๊ธฐ: ์‹œ๊ฐ„ ์ถ•์œผ๋กœ ํ™•์žฅ๋œ ํ•„ํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ Kt๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, 2D CNN์—์„œ ์–ป์—ˆ๋˜ ๋™์ผํ•œ ์ถœ๋ ฅ ํŠน์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์‹œ๊ฐ„ ์ถ•์„ ๋ฐ˜์˜ํ•œ 3D ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • ์„ฑ๋Šฅ ๋น„๊ต: ์œ„์˜ ์„ฑ๋Šฅ ๋น„๊ต ๊ทธ๋ž˜ํ”„์— ๋”ฐ๋ฅด๋ฉด, I3D๋Š” ์ „ํ†ต์ ์ธ 2D CNN๊ณผ ๋น„๊ตํ•  ๋•Œ ๋น„๋””์˜ค ๋ถ„์„์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํŠนํžˆ, Top-1 ์ •ํ™•๋„์—์„œ ํฐ ํ–ฅ์ƒ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” I3D์˜ ๊ตฌ์กฐ๊ฐ€ ๋น„๋””์˜ค์—์„œ์˜ ์‹œ๊ฐ„์  ํŠน์ง•์„ ์ž˜ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

SlowFast Network

  • ์‹œ๊ฐ„์  ํ•ด์ƒ๋„: SlowFast Network์˜ ์ฃผ์š” ํŠน์ง•์€ ์„œ๋กœ ๋‹ค๋ฅธ ์‹œ๊ฐ„์  ํ•ด์ƒ๋„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๊ฐœ์˜ ๊ฒฝ๋กœ(Slow pathway์™€ Fast pathway)๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค๋ฅผ ๋ถ„์„ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

    • Slow Pathway: ๋‚ฎ์€ FPS (์˜ˆ: 30 FPS) - ๋น„๋””์˜ค์˜ ์ „์ฒด์ ์ธ ์ปจํ…์ŠคํŠธ์™€ ๊ธด ์‹œ๊ฐ„ ํ”„๋ ˆ์ž„์„ ํ•™์Šต.

      • ์ด ๊ฒฝ๋กœ๋Š” ๋‚ฎ์€ ํ”„๋ ˆ์ž„ ์†๋„๋กœ ๋น„๋””์˜ค์˜ ์ „์ฒด์ ์ธ ์ปจํ…์ŠคํŠธ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

        ๐Ÿ”Ž ๋‚ฎ์€ FPS: Slow Pathway๋Š” ๋น„๋””์˜ค์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์ˆ˜์˜ ํ”„๋ ˆ์ž„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ดˆ๋‹น 30๊ฐœ์˜ ํ”„๋ ˆ์ž„๋งŒ ์ฒ˜๋ฆฌํ•œ๋‹ค๋ฉด, ์ด๋Š” ๋น„๋””์˜ค์—์„œ ๋œ ๋นˆ๋ฒˆํ•œ ์—…๋ฐ์ดํŠธ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

      • ์ฃผ๋กœ ๋น„๋””์˜ค์˜ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ์™€ ํ๋ฆ„์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋ฉฐ, ๋†’์€ ์ฑ„๋„ ๊นŠ์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
      • ์ผ๋ฐ˜์ ์œผ๋กœ ฮฑ(ํ”„๋ ˆ์ž„ ์†๋„ ๋น„์œจ)๋Š” 8๋กœ ์„ค์ •๋˜์–ด, Fast Pathway๋ณด๋‹ค 8๋ฐฐ ๋А๋ฆฐ ์†๋„๋กœ ๋น„๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
    • Fast Pathway: ๋†’์€ FPS (์˜ˆ: 240 FPS) - ๋น„๋””์˜ค์˜ ์„ธ๋ถ€์ ์ธ ๋™์ž‘๊ณผ ์งง์€ ์‹œ๊ฐ„ ๋‚ด์˜ ๋ณ€ํ™” ํฌ์ฐฉ.

      • ์ด ๊ฒฝ๋กœ๋Š” ๋†’์€ ํ”„๋ ˆ์ž„ ์†๋„๋กœ ์„ธ๋ถ€์ ์ธ ์›€์ง์ž„์„ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.

        ๐Ÿ”Ž๋†’์€ FPS: Fast Pathway๋Š” ๋น„๋””์˜ค์˜ ์„ธ๋ฐ€ํ•œ ๋ณ€ํ™”๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ๋งค์šฐ ๋†’์€ FPS๋กœ ๋น„๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ดˆ๋‹น 240๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์ฒ˜๋ฆฌํ•œ๋‹ค๋ฉด, ์ด๋Š” ์งง์€ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ ๋‚ด์˜ ์ž‘์€ ๋ณ€ํ™”๋„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

      • ์„ธ๋ถ€์ ์ธ ๋™์ž‘ ํŒจํ„ด์„ ๋น ๋ฅด๊ฒŒ ์บก์ฒ˜ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์ฑ„๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€๋ณ๊ฒŒ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
      • ์ฑ„๋„ ๋น„์œจ ฮธ๋Š” 1/8๋กœ ์„ค์ •๋˜์–ด, Slow Pathway์— ๋น„ํ•ด ์ ์€ ์ฑ„๋„๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  • ์ธก๋ฉด ์—ฐ๊ฒฐ(Lateral Connections): Slow pathway์™€ Fast pathway ๊ฐ„์˜ lateral connections์€ ๋‘ ๊ฒฝ๋กœ ์‚ฌ์ด์˜ ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๋ฉฐ, ์‹œ๊ฐ„์  ๋‹ค์ด๋‚˜๋ฏน์Šค๋ฅผ ์ž˜ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์‹œ๊ฐ„์  ํŒจํ„ด์„ ๋” ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ํ…Œ์ด๋ธ” ํ•ด์„:
    • ์†๋„ ๋น„์œจ(ฮฑ)๊ณผ ์ฑ„๋„ ๋น„์œจ(ฮฒ):
      • ์†๋„ ๋น„์œจ(ฮฑ): Slow์™€ Fast ๊ฒฝ๋กœ์˜ ํ”„๋ ˆ์ž„๋ ˆ์ดํŠธ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ฮฑ = 8์€ Slow Pathway๊ฐ€ Fast Pathway๋ณด๋‹ค 8๋ฐฐ ์ ์€ ํ”„๋ ˆ์ž„์„ ์ฒ˜๋ฆฌํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
      • ์ฑ„๋„ ๋น„์œจ(ฮฒ): Slow Pathway์™€ Fast Pathway ์‚ฌ์ด์˜ ์ฑ„๋„ ์ˆ˜ ๋น„์œจ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ฮฒ = 1/8์€ Fast Pathway๊ฐ€ Slow Pathway๋ณด๋‹ค 8๋ฐฐ ์ ์€ ์ฑ„๋„์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค
    • ResNet-50 Backbone
      • ResNet-50 ๋ฐฑ๋ณธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜์—ˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ™•์žฅํ•ด ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์—์„œ์˜ ์‹œ๊ฐ„ ๋ฐ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.
      • ์œ„ ํ‘œ๋Š” ๊ฐ ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ŠคํŠธ๋ผ์ด๋“œ์™€ ์ถœ๋ ฅ ํฌ๊ธฐ, ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๊ฒฝ๋กœ์—์„œ์˜ ํŠน์„ฑ ์ถ”์ถœ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ์‹œ๊ฐ„์  ํ’€๋ง X (No temporal pooling):
      • ๋ชจ๋“  ๋‹จ๊ณ„์—์„œ ์‹œ๊ฐ„ ์ฐจ์› ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋ฉฐ, ์ตœ์ข… ๋ ˆ์ด์–ด์—์„œ ๊ธ€๋กœ๋ฒŒ ํ‰๊ท  ํ’€๋ง๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๊ฐ•์˜๋Š” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๊ณ , ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์ด ์–ด๋–ค ์ƒํ™ฉ์—์„œ ํšจ๊ณผ์ ์ธ์ง€์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹์ด ๋น„๋””์˜ค ๋ถ„์„์— ์ ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๊ทธ ํ•œ๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์œ ์šฉํ•œ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.



-->