[๋„์„œ๋ฆฌ๋ทฐ] ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ๊ณ„๋ณด ์ •๋ฆฌ(ํ˜ํŽœํ•˜์ž„์˜ ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€)

Posted by Euisuk's Dev Log on February 8, 2025

[๋„์„œ๋ฆฌ๋ทฐ] ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ๊ณ„๋ณด ์ •๋ฆฌ(ํ˜ํŽœํ•˜์ž„์˜ ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€)

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/๋ฆฌ๋ทฐ-์‹ ํ˜ธ์ฒ˜๋ฆฌ-๊ธฐ์ดˆ-์ •๋ฆฌํ˜ํŽœํ•˜์ž„์˜-ใ€ŽEasy-๋”ฅ๋Ÿฌ๋‹ใ€-yzyniith

์•ˆ๋…•ํ•˜์„ธ์š”! ์ง€๋‚œ ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€ ๋„์„œ ์†Œ๊ฐœ ๊ฒŒ์‹œ๊ธ€(๋”ฅ๋Ÿฌ๋‹ ์ž…๋ฌธ์ž๋ฅผ ์œ„ํ•œ ์ฑ… ์ถ”์ฒœ, ํ˜ํŽœํ•˜์ž„ ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€)์— ์ด์–ด์„œ ์˜ค๋Š˜์€ ํ•ต์‹ฌ ์ฑ•ํ„ฐ ๋ถ„์„ ๋ฐ ์‹ฌ์ธต ํƒ๊ตฌ๋ฅผ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค

๐Ÿ“ธ (์ฐธ๊ณ ) ์ฑ… ์ด๋ฏธ์ง€๋“ค์€ ๋ฆฌ๋ทฐ ๋ชฉ์ ์œผ๋กœ ์ง์ ‘ ์ดฌ์˜ ํ›„ ์ฒจ๋ถ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ด๋ฒˆ ๊ฒŒ์‹œ๊ธ€์—์„œ๋Š” โ€œChapter 2 โ€“ ์ธ๊ณต ์‹ ๊ฒฝ๋ง๊ณผ ์„ ํ˜• ํšŒ๊ท€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™” ๊ธฐ๋ฒ•๋“คโ€์„ ๋‹ค๋ฃฐ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

  • ์ด ์žฅ์—์„œ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง, ์„ ํ˜• ํšŒ๊ท€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™” ๊ธฐ๋ฒ• ๋“ฑ ๋”ฅ๋Ÿฌ๋‹์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ ์ธ ํ•ต์‹ฌ ๊ฐœ๋…๋“ค์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ฌ Table of Contents for Chapter 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
โœ… Chapter 2 โ€“ ์ธ๊ณต ์‹ ๊ฒฝ๋ง๊ณผ ์„ ํ˜• ํšŒ๊ท€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™” ๊ธฐ๋ฒ•๋“ค

2.1 ์ธ๊ณต ์‹ ๊ฒฝ: Weight์™€ Bias์˜ ์ง๊ด€์  ์ดํ•ด
2.2 ์ธ๊ณต ์‹ ๊ฒฝ๋ง๊ณผ MLP
2.3 ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ ํ•จ์ˆ˜๋‹ค!
2.4 ์„ ํ˜• ํšŒ๊ท€, ๊ฐœ๋…๋ถ€ํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€ step by step
2.5 ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•
2.5.1 ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์˜ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ
2.6 ์›จ์ดํŠธ ์ดˆ๊ธฐํ™”
2.7 ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•
2.8 Mini-Batch Gradient Descent
2.8.1 Batch Size์™€ Learning Rate์˜ ์กฐ์ ˆ
2.9 Momentum
2.10 RMSProp
2.11 Adam
2.12 ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ
2.12.1 K-fold ๊ต์ฐจ ๊ฒ€์ฆ

์ €๋„ ์ด๋ฒˆ์— ์ด ์ฑ…์˜ ๋ฆฌ๋ทฐ์–ด๋กœ ์„ ์ •๋˜์–ด ๋‚ด์šฉ์„ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋ฉด์„œ, ์นœ์ ˆํ•œ ์˜ˆ์‹œ์™€ ์‰ฌ์šด ์„ค๋ช… ๋•๋ถ„์— ๋”ฅ๋Ÿฌ๋‹ ๊ฐœ๋…์„ ๋ณด๋‹ค ํƒ„ํƒ„ํ•˜๊ฒŒ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ „๊ณต์„œ์˜ ๋”ฑ๋”ฑํ•œ ์„ค๋ช…์ด ๋ถ€๋‹ด์Šค๋Ÿฌ์šฐ์…จ๋‹ค๋ฉด, ์ด ์ฑ…์€ ์ž…๋ฌธ์ž์—๊ฒŒ๋„ ๋ถ€๋‹ด ์—†์ด ์ถ”์ฒœํ•  ๋งŒํ•œ ์ฑ…์ž…๋‹ˆ๋‹ค!

๊ธฐ์šธ๊ธฐ์™€ ์ตœ์ ํ™” ๊ธฐ๋ฒ•

1. ๊ทธ๋ž˜๋””์–ธํŠธ(Gradient)๋ž€?

๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์—์„œ ํ•จ์ˆ˜๊ฐ’์ด ๊ฐ€์žฅ ๊ฐ€ํŒŒ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ๊ณผ ๊ทธ ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

Ex. ์–ด๋–ค ํ•จ์ˆ˜ f(x,y)f(x,y)f(x,y)๊ฐ€ ์žˆ์„ ๋•Œ, ํŠน์ • ์ง€์ ์—์„œ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๊ฐ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„(partial derivative)์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โˆ‡f(x,y)=[โˆ‚fโˆ‚x,โˆ‚fโˆ‚y]\nabla f(x, y) = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right]โˆ‡f(x,y)=[โˆ‚xโˆ‚fโ€‹,โˆ‚yโˆ‚fโ€‹]

์ด ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฒกํ„ฐ๋Š” ํ•ด๋‹น ์ง€์ ์—์„œ ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค.

(์ขŒ) ํ˜ํŽœํ•˜์ž„ ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€, (์ƒ) Gradient Descent, (ํ•˜) GD ๋น„์œ  - ๋“ฑ์‚ฐ

์œ„ ๊ทธ๋ฆผ์„ ๊ฐ€์ง€๊ณ  ์‹ค์ œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•œ ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Ex. ํ•จ์ˆ˜ f(x,y)=x2+y2f(x, y) = x^2 + y^2f(x,y)=x2+y2๋ฅผ ์ƒ๊ฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ์ด ํ•จ์ˆ˜๋Š” ์›์  (0,0)(0,0)(0,0)์—์„œ ์ตœ์†Œ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • ์ฆ‰, ์šฐ๋ฆฌ๊ฐ€ ๋ชฉํ‘œํ•˜๋Š” ์ตœ์†Œ๊ฐ’์„ ์ฐพ์œผ๋ ค๋ฉด ์›์ ์œผ๋กœ ์ด๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

1. ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ

  • ์•ž์—์„œ ์ •์˜ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ํ˜„์žฌ ์œ„์น˜ (x,y)(x, y)(x,y)์—์„œ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์€ ํ•จ์ˆ˜ f(x,y)=x2+y2f(x, y) = x^2 + y^2f(x,y)=x2+y2์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์ธ [2x,2y][2x, 2y][2x,2y]์ž…๋‹ˆ๋‹ค.

โˆ‡f(x,y)=[โˆ‚fโˆ‚x,โˆ‚fโˆ‚y]=[2x,2y]\nabla f(x, y) = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right] = [2x, 2y]โˆ‡f(x,y)=[โˆ‚xโˆ‚fโ€‹,โˆ‚yโˆ‚fโ€‹]=[2x,2y]

  • โˆ‡f(x,y)=[2x,2y]\nabla f(x, y) = [2x, 2y]โˆ‡f(x,y)=[2x,2y]๋Š” (x,y)(x,y)(x,y)์—์„œ ํ•จ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๐Ÿ“– (์ฐธ๊ณ ) Gradient Descent์˜ ๋ชฉํ‘œ๋Š” Loss Function์„ ์ตœ์†Œํ™”ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์šฐ๋ฆฌ๊ฐ€ ์–ด๋–ค ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€ ๋•Œ, ์˜ˆ๋ฅผ ๋“ค์–ด ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ๋Š” ์†์‹ค ํ•จ์ˆ˜(Loss Function, L(x,y)) ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ๋Š” f(x,y)f(x,y)f(x,y)๊ฐ€ ์‹ค์ œ๋กœ ์†์‹ค ํ•จ์ˆ˜ L(x,y)L(x,y)L(x,y) ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ตญ, ์šฐ๋ฆฌ๋Š” ์ด(f(x,y)f(x,y)f(x,y), L(x,y)L(x,y)L(x,y))๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

2. Gradient Descent ์ˆ˜ํ–‰

  • ์šฐ๋ฆฌ๋Š” ์†์‹ค์„ ์ค„์ด๊ณ  ์‹ถ์œผ๋ฏ€๋กœ ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

(x,y)โ†(x,y)โˆ’ฮฑโ‹…โˆ‡f(x,y)(x, y) \leftarrow (x, y) - \alpha \cdot \nabla f(x, y)(x,y)โ†(x,y)โˆ’ฮฑโ‹…โˆ‡f(x,y)

  • ์—ฌ๊ธฐ์„œ ฮฑ\alphaฮฑ๋Š” ํ•™์Šต๋ฅ (learning rate)๋กœ, ์–ผ๋งˆ๋‚˜ ํฌ๊ฒŒ ์ด๋™ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

3. ์—…๋ฐ์ดํŠธ ๊ณผ์ •

  • ์˜ˆ๋ฅผ ๋“ค์–ด, ์ดˆ๊ธฐ๊ฐ’์ด (1,1)์ด๊ณ  ํ•™์Šต๋ฅ ์ด 0.1์ด๋ฉด, ์—…๋ฐ์ดํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

x=1โˆ’0.1ร—2(1)=0.8x = 1 - 0.1 \times 2(1) = 0.8x=1โˆ’0.1ร—2(1)=0.8

y=1โˆ’0.1ร—2(1)=0.8y = 1 - 0.1 \times 2(1) = 0.8y=1โˆ’0.1ร—2(1)=0.8

โ†’ ์ฆ‰, (1,1)์—์„œ (0.8, 0.8)๋กœ ์ด๋™ํ•˜๋ฉด์„œ ์†์‹ค์ด ๊ฐ์†Œ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก (์ •๋ฆฌ) ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ค๋ฏ€๋กœ, ๊ทธ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๋ฉด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์ด ๋ฉ๋‹ˆ๋‹ค.

2. ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด๋ž€?

์ตœ์ ํ™”(Optimization)๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฐ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ์†์‹ค ํ•จ์ˆ˜(Loss Function)๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์ค‘์น˜(Weight)์™€ ํŽธํ–ฅ(Bias)์„ ์กฐ์ •ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

โœ” ์™œ ์ตœ์ ํ™”๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?

  • ๋ชจ๋ธ์ด ๋” ๋‚˜์€ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •
  • ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ์†๋„๋ฅผ ๊ฐœ์„ 
  • ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ ๋น„์šฉ์„ ์ค„์ด๊ณ  ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต ๊ฐ€๋Šฅ

๊ธฐ๋ณธ์ ์ธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์œ„์—์„œ ์†Œ๊ฐœํ•œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent, GD)์ด ์žˆ์ง€๋งŒ, ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ณ€ํ˜•๋œ ๋ฐฉ๋ฒ•๋“ค์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

์ตœ์ ํ™” ๊ธฐ๋ฒ•๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ณตํ†ต์ ์ธ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€์˜ค ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’Œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•๋“ค์˜ ๊ณตํ†ต ๋ชฉํ‘œ

  1. ํ•™์Šต ์†๋„๋ฅผ ๋†’์ด๊ณ  ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ค„์ด๋Š” ๊ฒƒ
  2. ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ตœ์ ํ•ด์— ๋„๋‹ฌํ•˜๋Š” ๊ฒƒ
  3. ์ง„๋™์„ ์ค„์ด๊ณ  ํšจ์œจ์ ์ธ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ

๐Ÿฆพ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์˜ ๋ฐœ์ „ ๊ณผ์ •

์ดˆ๊ธฐ Gradient Descent ๋ฐฉ์‹์—์„œ ์ถœ๋ฐœํ•˜์—ฌ, ๋” ํšจ์œจ์ ์ธ ํ•™์Šต๊ณผ ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด์„ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ๊ฐ๊ฐ์˜ ๋ฌธ์ œ์ ์„ ๋ณด์™„ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ ์ง„์ ์œผ๋กœ ๋ฐœ์ „ํ•ด์™”์Šต๋‹ˆ๋‹ค.
์ตœ์ ํ™” ๊ธฐ๋ฒ• ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ ์ฃผ์š” ํŠน์ง•
Gradient Descent (GD) 1847๋…„ Cauchy ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šตํ•˜์ง€๋งŒ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์Œ
Stochastic Gradient Descent (SGD) 1951๋…„ Robbins & Monro ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•˜์ง€๋งŒ ์ง„๋™์ด ํผ
Mini-Batch Gradient Descent 1980๋…„๋Œ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์™€ ์ƒ˜ํ”Œ์˜ ์ ˆ์ถฉ์•ˆ์œผ๋กœ ํšจ์œจ์ 
Momentum 1964๋…„ Polyak ์ง„๋™์„ ์ค„์ด๊ณ  ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด
Nesterov Accelerated Gradient (NAG) 1983๋…„ Nesterov Momentum์˜ ๊ฐœ์„  ๋ฒ„์ „, ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด
AdaGrad 2011๋…„ Duchi et al. ํฌ์†Œํ•œ ๋ฐ์ดํ„ฐ์— ๊ฐ•ํ•˜์ง€๋งŒ ํ•™์Šต๋ฅ  ๊ฐ์†Œ ๋ฌธ์ œ
RMSProp 2012๋…„ Hinton AdaGrad์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ, ํ•™์Šต๋ฅ  ์กฐ์ ˆ ๊ฐ€๋Šฅ
AdaDelta 2012๋…„ Zeiler ํ•™์Šต๋ฅ  ๊ฐ์†Œ ๋ฌธ์ œ ํ•ด๊ฒฐ, ํ•™์Šต๋ฅ ์„ ๋™์ ์œผ๋กœ ์กฐ์ •
Adam 2014๋…„ Kingma & Ba Momentum๊ณผ RMSProp ๊ฒฐํ•ฉ, ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ
NAdam 2016๋…„ Dozat Adam์— Nesterov Momentum ์ถ”๊ฐ€
  • ์•„๋ž˜ ๊ทธ๋ฆผ์€ ๋ณ„๋„์˜ ์ฐธ๊ณ  ์ž๋ฃŒ๋กœ ํ•˜์šฉํ˜ธ ๋‹˜์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋ก  ๊ณ„๋ณด ์‹œ๊ฐํ™” ์ž๋ฃŒ ๊ณต์œ ๋“œ๋ฆฝ๋‹ˆ๋‹ค.
    • ํ•œ๋ˆˆ์— ์‚ดํŽด๋ณด๊ธฐ ์ข‹๊ฒŒ ์ •๋ฆฌ๊ฐ€ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. (์•„๋ž˜ ๊ทธ๋ฆผ ์ฐธ๊ณ ) ๐Ÿ™Œ

์ถœ์ฒ˜: ํ•˜์šฉํ˜ธ ๋‹˜ SlideShare ์ž๋ฃŒ (https://www.slideshare.net/slideshow/ss-79607172/79607172#49)

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ 2.5 ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•, 2.7 ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•, 2.8 Mini-Batch Gradient Descent, 2.8 Mini-Batch Gradient Descent, 2.9 Momentum, 2.10 RMSProp, ๊ทธ๋ฆฌ๊ณ  2.11 Adam๊นŒ์ง€ Optimzier ์‹œ๋ฆฌ์ฆˆ๋ฅผ ๋ฌถ์–ด์„œ ์ •๋ฆฌ ๋ฐ ์‚ดํŽด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. (+ NAG, AdaGrad, AdaDelta, NAdam ์ž์ฒด ์ถ”๊ฐ€)


์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ •๋ฆฌ

1. Gradient Descent(GD, ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•) - (1847, Cauchy)

๋ฐฐ๊ฒฝ

Gradient Descent๋Š” ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ์—ฐ์† ํ•จ์ˆ˜์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 19์„ธ๊ธฐ ์ˆ˜ํ•™์ž Augustin-Louis Cauchy์— ์˜ํ•ด ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ดํ›„ ์ปดํ“จํ„ฐ ๊ณผํ•™๊ณผ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  • ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ์„œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์†Œ์ ์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ํ™•๋ฆฝ.
  • ์„ ํ˜• ํšŒ๊ท€ ๋ฐ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์—์„œ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•.

์ˆ˜์‹

ฮธt+1=ฮธtโˆ’ฮฑโˆ‡L(ฮธt)\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)ฮธt+1โ€‹=ฮธtโ€‹โˆ’ฮฑโˆ‡L(ฮธtโ€‹)

  • ฮธt\theta_tฮธtโ€‹: ttt๋ฒˆ์งธ ์—…๋ฐ์ดํŠธ ์‹œ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ
  • ฮฑ\alphaฮฑ: ํ•™์Šต๋ฅ  (Learning Rate)
  • โˆ‡L(ฮธt)\nabla L(\theta_t)โˆ‡L(ฮธtโ€‹): ์†์‹ค ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ

ํ•œ๊ณ„

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ ๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ํฌ๊ณ  ์†๋„๊ฐ€ ๋А๋ฆผ.
  • ๋ฐ์ดํ„ฐ์…‹์ด ์ปค์งˆ์ˆ˜๋ก ํ•™์Šต ์†๋„๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๊ฐ์†Œ.

  1. Stochastic Gradient Descent (SGD, ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•) - (1951, Robbins & Monro)

๋ฐฐ๊ฒฝ

๊ธฐ์กด Gradient Descent(GD)์˜ ์ฃผ์š” ๋ฌธ์ œ๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ  ์—…๋ฐ์ดํŠธ ์†๋„๊ฐ€ ๋А๋ฆฌ๋‹ค๋Š” ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Stochastic Gradient Descent(SGD)๊ฐ€ ๋„์ž…๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  • ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ ํ•˜๋‚˜๋งŒ ์„ ํƒํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๊ณ„์‚ฐ๋Ÿ‰์„ ํฌ๊ฒŒ ์ค„์ž„.
  • ํ™•๋ฅ ์ ์ธ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ง€์—ญ ์ตœ์ ํ•ด(Local Minima)๋ฅผ ๋ฒ—์–ด๋‚˜ ์ „์—ญ ์ตœ์ ํ•ด(Global Minima)๋กœ ์ด๋™ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์ฆ๊ฐ€.

์ˆ˜์‹

ฮธt+1=ฮธtโˆ’ฮฑโˆ‡L(ฮธt;xi)\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t; x_i)ฮธt+1โ€‹=ฮธtโ€‹โˆ’ฮฑโˆ‡L(ฮธtโ€‹;xiโ€‹)

  • xix_ixiโ€‹: ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒ๋œ ์ƒ˜ํ”Œ

ํ•œ๊ณ„

  • ๋งค ์—…๋ฐ์ดํŠธ๊ฐ€ ๋‹จ์ผ ์ƒ˜ํ”Œ์— ์˜ํ•ด ๊ฒฐ์ •๋˜๋ฏ€๋กœ ์ง„๋™(Oscillation)์ด ์‹ฌํ•  ์ˆ˜ ์žˆ์Œ.
  • ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ์›€์ง์ด๋ฉฐ ์ˆ˜๋ ด ์†๋„๊ฐ€ ์ผ์ •ํ•˜์ง€ ์•Š์Œ.

  1. Mini-Batch Gradient Descent (1980s)

๋ฐฐ๊ฒฝ
GD์™€ SGD์˜ ์žฅ๋‹จ์ ์„ ์ ˆ์ถฉํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋“ฑ์žฅ.

  • GD: ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด, ํ•˜์ง€๋งŒ ๊ณ„์‚ฐ๋Ÿ‰์ด ํผ.
  • SGD: ๋น ๋ฅธ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ, ํ•˜์ง€๋งŒ ์ง„๋™์ด ์‹ฌํ•จ.

๊ธฐ์—ฌ (Contribution)

  • SGD์˜ ์†๋„์™€ GD์˜ ์•ˆ์ •์„ฑ์„ ๋™์‹œ์— ํ™•๋ณด.
  • ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ž‘์€ Mini-Batch(๋ฏธ๋‹ˆ๋ฐฐ์น˜) ๋‹จ์œ„๋กœ ๊ฒฝ์‚ฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰๊ณผ ์•ˆ์ •์„ฑ์„ ์กฐ์ ˆ.

์ˆ˜์‹

ฮธt+1=ฮธtโˆ’ฮฑโˆ‡L(ฮธt;Xmini-batch)\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t; X_{\text{mini-batch}})ฮธt+1โ€‹=ฮธtโ€‹โˆ’ฮฑโˆ‡L(ฮธtโ€‹;Xmini-batchโ€‹)

  • Xmini-batchX_{\text{mini-batch}}Xmini-batchโ€‹: ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒ๋œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋ฐ์ดํ„ฐ์…‹

ํ•œ๊ณ„

  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ ์ ˆํžˆ ์กฐ์ •ํ•˜์ง€ ์•Š์œผ๋ฉด SGD์˜ ์ง„๋™ ๋ฌธ์ œ๋‚˜ GD์˜ ๋А๋ฆฐ ์†๋„ ๋ฌธ์ œ๊ฐ€ ์—ฌ์ „ํžˆ ๋ฐœ์ƒ.

  1. Momentum (1964, Polyak)

๋ฐฐ๊ฒฝ
SGD๋Š” ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ถˆ์•ˆ์ •ํ•˜์—ฌ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ณก๋ฅ ์ด ๊ธ‰๊ฒฉํžˆ ๋ณ€ํ™”ํ•  ๊ฒฝ์šฐ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ ค์ง€๊ฑฐ๋‚˜ ์ง„๋™์ด ์‹ฌํ•ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฌผ๋ฆฌํ•™์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ด€์„ฑ(Inertia) ๊ฐœ๋…์„ ๋„์ž…ํ•˜์—ฌ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  • ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ ๋ฐฉํ–ฅ์„ ๊ณ ๋ คํ•˜์—ฌ ์ง„๋™์„ ์ค„์ด๊ณ  ์ˆ˜๋ ด ์†๋„๋ฅผ ํ–ฅ์ƒ.
  • ๊ฒฝ์‚ฌ๋ฉด์ด ๊ฐ€ํŒŒ๋ฅด๋ฉด ๋” ๋น ๋ฅด๊ฒŒ ์›€์ง์ด๊ณ , ํ‰ํƒ„ํ•œ ๊ตฌ๊ฐ„์—์„œ๋Š” ์†๋„๋ฅผ ์กฐ์ ˆํ•จ.

์ˆ˜์‹

vt=ฮฒvtโˆ’1+(1โˆ’ฮฒ)โˆ‡L(ฮธt)v_t = \beta v_{t-1} + (1 - \beta) \nabla L(\theta_t)vtโ€‹=ฮฒvtโˆ’1โ€‹+(1โˆ’ฮฒ)โˆ‡L(ฮธtโ€‹)

ฮธt+1=ฮธtโˆ’ฮฑvt\theta_{t+1} = \theta_t - \alpha v_tฮธt+1โ€‹=ฮธtโ€‹โˆ’ฮฑvtโ€‹

  • vtv_tvtโ€‹: ๊ธฐ์šธ๊ธฐ์˜ ์ด๋™ ํ‰๊ท  (์†๋„)
  • ฮฒ\betaฮฒ: ๋ชจ๋ฉ˜ํ…€ ๊ณ„์ˆ˜ (๋ณดํ†ต 0.9)

ํ•œ๊ณ„

  • ๋„ˆ๋ฌด ํฐ ๋ชจ๋ฉ˜ํ…€ ๊ฐ’์€ ์˜ค๋ฒ„์ŠˆํŒ…(Overshooting, ์ง€๋‚˜์นœ ์—…๋ฐ์ดํŠธ) ๋ฌธ์ œ๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Œ.

  1. Nesterov Accelerated Gradient (NAG) (1983, Nesterov)

๋ฐฐ๊ฒฝ
Momentum ๋ฐฉ์‹์€ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋ณ€ํ•˜๋Š” ์˜์—ญ์—์„œ ๋ถ€์ •ํ™•ํ•œ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ. NAG๋Š” ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ ์œ„์น˜์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž….

๊ธฐ์—ฌ (Contribution)

  • ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์ „์— ๋ฏธ๋ฆฌ ํ•œ ๋ฒˆ ์—…๋ฐ์ดํŠธ๋œ ์œ„์น˜์—์„œ ๊ณ„์‚ฐํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๋†’์ž„.
  • ์ˆ˜๋ ด ์†๋„๋ฅผ ๋”์šฑ ๋น ๋ฅด๊ฒŒ ํ•จ.

์ˆ˜์‹

vt=ฮณvtโˆ’1+ฮฑโˆ‡L(ฮธtโˆ’ฮณvtโˆ’1)v_t = \gamma v_{t-1} + \alpha \nabla L(\theta_t - \gamma v_{t-1})vtโ€‹=ฮณvtโˆ’1โ€‹+ฮฑโˆ‡L(ฮธtโ€‹โˆ’ฮณvtโˆ’1โ€‹)

ฮธt+1=ฮธtโˆ’vt\theta_{t+1} = \theta_t - v_tฮธt+1โ€‹=ฮธtโ€‹โˆ’vtโ€‹

ํ•œ๊ณ„

  • ์—…๋ฐ์ดํŠธ๊ฐ€ ๋”์šฑ ์ •๊ตํ•ด์ง€์ง€๋งŒ, ๊ณ„์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€.

  1. Adagrad (2011, Duchi, Hazan, Singer)

๋ฐฐ๊ฒฝ

  • ํ•™์Šต๋ฅ ์ด ๊ณ ์ •๋˜์–ด ์žˆ์œผ๋ฉด, ์–ด๋–ค ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๊ณผ๋„ํ•˜๊ฒŒ ์—…๋ฐ์ดํŠธ๋˜๊ณ  ์–ด๋–ค ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ์Œ.
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋„์ž….

๊ธฐ์—ฌ (Contribution)

  • ํ•™์Šต๋ฅ ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์กฐ์ •ํ•˜์—ฌ ํฌ์†Œ ๋ฐ์ดํ„ฐ(Sparse Data)์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„.

์ˆ˜์‹

ฮธt+1=ฮธtโˆ’ฮฑGt+ฯตโˆ‡L(ฮธt)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \nabla L(\theta_t)ฮธt+1โ€‹=ฮธtโ€‹โˆ’Gtโ€‹+ฯตโ€‹ฮฑโ€‹โˆ‡L(ฮธtโ€‹)

  • GtG_tGtโ€‹: ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ์˜ ๋ˆ„์ ํ•ฉ

ํ•œ๊ณ„

  • ํ•™์Šต๋ฅ ์ด ๊ณ„์† ๊ฐ์†Œํ•˜์—ฌ, ๋‚˜์ค‘์—๋Š” ์—…๋ฐ์ดํŠธ๊ฐ€ ๊ฑฐ์˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ.

  1. RMSProp (2012, Hinton)

๋ฐฐ๊ฒฝ
Adagrad์˜ ํ•™์Šต๋ฅ  ๊ฐ์†Œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ์˜ ๊ธฐ์šธ๊ธฐ์— ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐœ์„ .

๊ธฐ์—ฌ (Contribution)

  • ์ตœ์‹  ๊ทธ๋ž˜๋””์–ธํŠธ์— ๋” ๊ฐ€์ค‘์น˜๋ฅผ ๋‘์–ด Adagrad์˜ ํ•™์Šต ์ •์ฒด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ.

์ˆ˜์‹

E[g2]t=ฮฒE[g2]tโˆ’1+(1โˆ’ฮฒ)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2E[g2]tโ€‹=ฮฒE[g2]tโˆ’1โ€‹+(1โˆ’ฮฒ)gt2โ€‹

ฮธt+1=ฮธtโˆ’ฮฑE[g2]t+ฯตgt\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} g_tฮธt+1โ€‹=ฮธtโ€‹โˆ’E[g2]tโ€‹+ฯตโ€‹ฮฑโ€‹gtโ€‹

ํ•œ๊ณ„

  • ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์—ฌ์ „ํžˆ ํ•™์Šต ์†๋„๊ฐ€ ๋А๋ ค์งˆ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ.

8. Adadelta (2012, Zeiler)

๋ฐฐ๊ฒฝ
Adagrad์˜ ์ฃผ์š” ํ•œ๊ณ„์ ์€ ๊ธฐ์šธ๊ธฐ์˜ ์ œ๊ณฑํ•ฉ(GtG_tGtโ€‹)์ด ๊ณ„์† ์ปค์ง€๋ฉด์„œ ํ•™์Šต๋ฅ ์ด ์ ์  ์ค„์–ด๋“ค์–ด ํ•™์Šต์ด ์ •์ฒด๋˜๋Š” ๋ฌธ์ œ์˜€์Šต๋‹ˆ๋‹ค.
Adadelta๋Š” ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ํ•™์Šต๋ฅ ์ด ๊ณผ๋„ํ•˜๊ฒŒ ์ค„์–ด๋“œ๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  1. Adagrad์˜ ํ•™์Šต๋ฅ  ๊ฐ์†Œ ๋ฌธ์ œ ํ•ด๊ฒฐ
  • ๊ณผ๊ฑฐ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ์˜ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ์ตœ๊ทผ์˜ ๊ธฐ์šธ๊ธฐ๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆ.
  1. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ฮฑ(learning rate)๋ฅผ ์ œ๊ฑฐ
    • ํ•™์Šต๋ฅ ์„ ์ž๋™์œผ๋กœ ์กฐ์ ˆํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๋ณ„๋„๋กœ ํ•™์Šต๋ฅ ์„ ์„ค์ •ํ•˜์ง€ ์•Š์•„๋„ ๋จ.

์ˆ˜์‹
1) ๊ธฐ์šธ๊ธฐ์˜ ์ œ๊ณฑ์— ๋Œ€ํ•œ ์ด๋™ ํ‰๊ท  ์œ ์ง€

E[g2]t=ฮณE[g2]tโˆ’1+(1โˆ’ฮณ)gt2E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g_t^2E[g2]tโ€‹=ฮณE[g2]tโˆ’1โ€‹+(1โˆ’ฮณ)gt2โ€‹

2) ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ์— ๋Œ€ํ•œ ์ด๋™ ํ‰๊ท  ์œ ์ง€

E[ฮ”ฮธ2]t=ฮณE[ฮ”ฮธ2]tโˆ’1+(1โˆ’ฮณ)ฮ”ฮธt2E[\Delta\theta^2]_t = \gamma E[\Delta\theta^2]_{t-1} + (1 - \gamma) \Delta\theta_t^2E[ฮ”ฮธ2]tโ€‹=ฮณE[ฮ”ฮธ2]tโˆ’1โ€‹+(1โˆ’ฮณ)ฮ”ฮธt2โ€‹

3) ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ

ฮธt+1=ฮธtโˆ’E[ฮ”ฮธ2]t+ฯตE[g2]t+ฯตgt\theta_{t+1} = \theta_t - \frac{\sqrt{E[\Delta\theta^2]_t + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_tฮธt+1โ€‹=ฮธtโ€‹โˆ’E[g2]tโ€‹+ฯตโ€‹E[ฮ”ฮธ2]tโ€‹+ฯตโ€‹โ€‹gtโ€‹

  • ฮณ\gammaฮณ (๋ณดํ†ต 0.9) : ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ์˜ ์˜ํ–ฅ๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฐ์‡  ๊ณ„์ˆ˜ (decay factor).
  • ฯต\epsilonฯต : ์ˆ˜์‹ ์•ˆ์ •์„ฑ์„ ์œ„ํ•œ ์ž‘์€ ๊ฐ’.

ํ•œ๊ณ„

  • ๊ณผ๊ฑฐ์˜ ๋ณ€ํ™”๋Ÿ‰์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋น ๋ฅธ ๋ณ€ํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋Š” ์ตœ์ ํ™” ์†๋„๊ฐ€ ๋‹ค์†Œ ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Œ.

9. Adam (2014, Kingma & Ba)

๋ฐฐ๊ฒฝ
Adam(Adaptive Moment Estimation)์€ Momentum๊ณผ RMSProp์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋‹จ์ ์„ ๊ฐœ์„ ํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • Momentum: ๊ธฐ์šธ๊ธฐ์˜ ์ด๋™ ํ‰๊ท ์„ ์œ ์ง€ํ•˜์—ฌ ์ง„๋™์„ ์ค„์ด๊ณ  ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด.
  • RMSProp: ํ•™์Šต๋ฅ ์„ ์ ์‘์ ์œผ๋กœ ์กฐ์ ˆํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฐฉ์ง€.

Adam์€ ์ด๋Ÿฌํ•œ ๋‘ ๊ฐ€์ง€ ๊ฐœ๋…์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ด๋ฉด์„œ๋„ ์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  1. Momentum + RMSProp์˜ ์žฅ์  ๊ฒฐํ•ฉ
  • ํ•™์Šต๋ฅ ์„ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•˜์—ฌ ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ธ ํ•™์Šต ๊ฐ€๋Šฅ.
  1. ๊ธฐ์šธ๊ธฐ์˜ 1์ฐจ(moment)์™€ 2์ฐจ(moment)๋ฅผ ๋™์‹œ์— ๊ณ ๋ ค
    • ์ผ๋ฐ˜์ ์ธ ๊ธฐ์šธ๊ธฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ณ€๋™์„ฑ(variance)๊นŒ์ง€ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆ.

์ˆ˜์‹
1) 1์ฐจ ๋ชจ๋ฉ˜ํŠธ(๊ธฐ์šธ๊ธฐ์˜ ์ด๋™ ํ‰๊ท )

mt=ฮฒ1mtโˆ’1+(1โˆ’ฮฒ1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_tmtโ€‹=ฮฒ1โ€‹mtโˆ’1โ€‹+(1โˆ’ฮฒ1โ€‹)gtโ€‹

2) 2์ฐจ ๋ชจ๋ฉ˜ํŠธ(๊ธฐ์šธ๊ธฐ ์ œ๊ณฑ์˜ ์ด๋™ ํ‰๊ท )

vt=ฮฒ2vtโˆ’1+(1โˆ’ฮฒ2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2vtโ€‹=ฮฒ2โ€‹vtโˆ’1โ€‹+(1โˆ’ฮฒ2โ€‹)gt2โ€‹

3) ํŽธํ–ฅ ๋ณด์ • (Bias Correction)
Adam์€ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ mtm_tmtโ€‹์™€ vtv_tvtโ€‹๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณด์ • ๊ณ„์ˆ˜๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

m^t=mt1โˆ’ฮฒ1t,v^t=vt1โˆ’ฮฒ2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}m^tโ€‹=1โˆ’ฮฒ1tโ€‹mtโ€‹โ€‹,v^tโ€‹=1โˆ’ฮฒ2tโ€‹vtโ€‹โ€‹

4) ์ตœ์ข… ์—…๋ฐ์ดํŠธ

ฮธt+1=ฮธtโˆ’ฮฑv^t+ฯตm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_tฮธt+1โ€‹=ฮธtโ€‹โˆ’v^tโ€‹โ€‹+ฯตฮฑโ€‹m^tโ€‹

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • ฮฒ1=0.9\beta_1 = 0.9ฮฒ1โ€‹=0.9 (๋ณดํ†ต) โ†’ 1์ฐจ ๋ชจ๋ฉ˜ํ…€์˜ ๊ฐ์‡  ๊ณ„์ˆ˜
  • ฮฒ2=0.999\beta_2 = 0.999ฮฒ2โ€‹=0.999 (๋ณดํ†ต) โ†’ 2์ฐจ ๋ชจ๋ฉ˜ํ…€์˜ ๊ฐ์‡  ๊ณ„์ˆ˜
  • ฯต=10โˆ’8\epsilon = 10^{-8}ฯต=10โˆ’8 (๋ณดํ†ต) โ†’ ์ˆ˜์‹ ์•ˆ์ •์„ฑ์„ ์œ„ํ•œ ์ž‘์€ ๊ฐ’

์žฅ์ 

  • ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅ.
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ๋œ ๋ฏผ๊ฐํ•˜๋ฉฐ ๊ธฐ๋ณธ ์„ค์ •์œผ๋กœ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ.
  • ๋Œ€๋ถ€๋ถ„์˜ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ ๊ธฐ๋ณธ ์˜ตํ‹ฐ๋งˆ์ด์ €๋กœ ์‚ฌ์šฉ๋จ.

ํ•œ๊ณ„

  • ์ผ๋ฐ˜ SGD๋ณด๋‹ค ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ(Generalization)์ด ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ.
  • ์ผ๋ถ€ ๋ฌธ์ œ์—์„œ๋Š” ํ•™์Šต๋ฅ ์ด ์ง€๋‚˜์น˜๊ฒŒ ์ ์‘์ ์œผ๋กœ ์กฐ์ •๋˜์–ด ์ตœ์ ํ•ด ๊ทผ์ฒ˜์—์„œ ๊ณผ๋„ํ•œ ์Šคํ…์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ.

10. Nadam (2016, Dozat)

๋ฐฐ๊ฒฝ
Adam์€ Momentum๊ณผ RMSProp์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ด์ง€๋งŒ, Momentum ์—…๋ฐ์ดํŠธ๊ฐ€ ํ˜„์žฌ ์œ„์น˜์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ ์ตœ์  ๊ฒฝ๋กœ๋ฅผ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Nesterov Accelerated Gradient (NAG)์˜ ๊ฐœ๋…์„ Adam์— ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด Nadam(Nesterov-accelerated Adaptive Moment Estimation)์ž…๋‹ˆ๋‹ค.

๊ธฐ์—ฌ (Contribution)

  1. Adam์˜ ํ•™์Šต๋ฅ  ์กฐ์ ˆ๊ณผ NAG์˜ ์˜ˆ์ธก ๊ธฐ๋Šฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋”์šฑ ๋น ๋ฅธ ์ˆ˜๋ ด ๊ฐ€๋Šฅ.
  2. ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์ „์— ๋จผ์ € ์ด๋™ํ•˜์—ฌ ์ตœ์ ํ•ด์— ๋„๋‹ฌํ•˜๋Š” ์†๋„๋ฅผ ๋†’์ž„.

์ˆ˜์‹
1) ๊ธฐ์กด์˜ Adam ์—…๋ฐ์ดํŠธ์™€ ๋น„๊ต
Adam์˜ ์—…๋ฐ์ดํŠธ:

ฮธt+1=ฮธtโˆ’ฮฑv^t+ฯตm^t\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_tฮธt+1โ€‹=ฮธtโ€‹โˆ’v^tโ€‹โ€‹+ฯตฮฑโ€‹m^tโ€‹

Nadam์˜ ์—…๋ฐ์ดํŠธ:

ฮธt+1=ฮธtโˆ’ฮฑv^t+ฯต(ฮฒ1m^t+(1โˆ’ฮฒ1)gt)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} (\beta_1 \hat{m}_t + (1 - \beta_1) g_t)ฮธt+1โ€‹=ฮธtโ€‹โˆ’v^tโ€‹โ€‹+ฯตฮฑโ€‹(ฮฒ1โ€‹m^tโ€‹+(1โˆ’ฮฒ1โ€‹)gtโ€‹)

์ฆ‰, ๊ธฐ์กด์˜ Adam์—์„œ ๋ชจ๋ฉ˜ํ…€์„ ์กฐ๊ธˆ ๋” ์•ž๋‹น๊ฒจ ๋ฐ˜์˜ํ•˜์—ฌ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

์žฅ์ 

  • Adam๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ด.
  • ํŠนํžˆ, ๊ณก๋ฅ ์ด ํฐ ์˜์—ญ์—์„œ ๋”์šฑ ํšจ๊ณผ์ .

ํ•œ๊ณ„

  • Adam๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ.
  • ์ผ๋ถ€ ๋ฌธ์ œ์—์„œ๋Š” Adam๊ณผ ํฐ ์ฐจ์ด๊ฐ€ ๋‚˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์Œ.

๊ฒฐ๋ก 

ํ˜„์žฌ ๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” Adam์ด ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋ฉฐ, ํŠน์ • ๋ฌธ์ œ์— ๋”ฐ๋ผ RMSProp, Nadam ๋“ฑ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
๊ฐ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๋ฉด์„œ ๋ฐœ์ „ํ•ด์™”์œผ๋ฉฐ, ์•ž์œผ๋กœ๋„ ์ƒˆ๋กœ์šด ๋ฐฉ์‹์ด ์—ฐ๊ตฌ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


๐Ÿ’ก ๋ณธ ๊ฒŒ์‹œ๊ธ€์€ ํ˜ํŽœํ•˜์ž„์˜ <Easy! ๋”ฅ๋Ÿฌ๋‹> ์ฑ…์˜ ๋ฆฌ๋ทฐ์–ด ํ™œ๋™์œผ๋กœ ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.



-->