(์„ค๋ช…์ถ”๊ฐ€) Q-Learning: ๊ฐ•ํ™”ํ•™์Šต์˜ ํ•ต์‹ฌ ๊ฐœ๋…๊ณผ ์ดํ•ด

Posted by Euisuk's Dev Log on January 31, 2025

(์„ค๋ช…์ถ”๊ฐ€) Q-Learning: ๊ฐ•ํ™”ํ•™์Šต์˜ ํ•ต์‹ฌ ๊ฐœ๋…๊ณผ ์ดํ•ด

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/์„ค๋ช…์ถ”๊ฐ€-Q-Learning-๊ฐ•ํ™”ํ•™์Šต์˜-ํ•ต์‹ฌ-๊ฐœ๋…๊ณผ-์ดํ•ด

ํ˜ํŽœํ•˜์ž„๋‹˜์˜ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€์ฑ…์„ ๋ณด๋‹ค๋ณด๋ฉด ์—„์ฒญ ์ค‘์š”ํ•œ ๋‚ด์šฉ๋“ค์ด ์‰ฝ๊ฒŒ ํ’€์ด๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ ๋ฐฐ์šฐ์‹œ๋Š” ๋ถ„๋“ค๋„ ์‰ฝ๊ฒŒ ๋”ฐ๋ผ์˜ค์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์‰ฌ์šด ์˜ˆ์ œ์™€ ์šฉ์–ด ์„ค๋ช…์ด ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์ง„ ์ถœ์ฒ˜ : ์ฑ… ๋‚ด์šฉ ์ผ๋ถ€ ์‚ฌ์ง„ ์ง์ ‘ ์ดฌ์˜

๊ฐ•ํ™”ํ•™์Šต์— ๋ฌธ์™ธํ•œ์ด์—ˆ๋˜ ์ €๋„ ๊ฐœ๋…์— ๋Œ€ํ•ด์„œ ์‰ฝ๊ฒŒ ๋งฅ๋ฝ์„ ์žก์„ ์ˆ˜ ์žˆ์–ด์„œ ๊ฐœ์ธ์ ์œผ๋กœ ๋„ˆ๋ฌด ์œ ์ตํ•œ ์‹œ๊ฐ„์ด์—ˆ์Šต๋‹ˆ๋‹ค. (pg 29 - 37)

ํ•˜์ง€๋งŒ! ์ €๋Š” ์—ฌ๊ธฐ์„œ ๊ทธ์น˜๋ฉด ์•„์‰ฝ๊ธฐ ๋•Œ๋ฌธ์— ์ข€ ๋” ๊นŠ๊ฒŒ ์‚ฝ์งˆ(?) ์ข€ ๋” ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค ใ…Žใ…Ž

์ด๋ฏธ์ง€ ์ถœ์ฒ˜: ์ด๋ง๋…„ ์‹œ๋ฆฌ์ฆˆ (์ง์ ‘ ํŽธ์ง‘)

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ๊ฐ•ํ™”ํ•™์Šต ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด๋ณด๊ณ , Q-learning์— ๋Œ€ํ•ด์„œ ๋ณด๋‹ค ๋” ์ƒ์„ธํ•˜๊ฒŒ ๊ณต๋ถ€ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

  1. ๊ฐ•ํ™”ํ•™์Šต๊ณผ Q-Learning์˜ ๊ธฐ๋ณธ ๊ฐœ๋…

๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning, RL)์€ ์ธ๊ณต์ง€๋Šฅ(AI) ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ ์ถœ์ฒ˜: https://minsuksung-ai.tistory.com/13

๊ทธ์ค‘์—์„œ๋„ Q-Learning์€ ํ™˜๊ฒฝ์˜ ๋™์ž‘ ๋ฐฉ์‹(์ „์ด ํ™•๋ฅ ์ด๋‚˜ ๋ณด์ƒ ํ•จ์ˆ˜ ๋“ฑ)์„ ๋ฏธ๋ฆฌ ์•Œ์ง€ ์•Š์•„๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.

  • ์ฆ‰, ํ™˜๊ฒฝ์„ ์ง์ ‘ ํƒ์ƒ‰ํ•˜๋ฉฐ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์ • ์ƒํƒœ์—์„œ ํŠน์ • ํ–‰๋™์„ ํ–ˆ์„ ๋•Œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ณด์ƒ์„ ๊ฒฝํ—˜์„ ํ†ตํ•ด ์˜ˆ์ธกํ•˜๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ ์ง„์ ์œผ๋กœ ๋” ๋‚˜์€ ์ •์ฑ…์„ ๋งŒ๋“ค์–ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.

๊ฐ•ํ™”ํ•™์Šต์ด๋ž€?

๊ฐ•ํ™”ํ•™์Šต์€ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํ–‰๋™์„ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

  • ์—์ด์ „ํŠธ(agent)๋Š” ํ™˜๊ฒฝ(environment)๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ์ƒํƒœ(state)๋ฅผ ๊ด€์ฐฐํ•˜๊ณ , ๊ฐ€๋Šฅํ•œ ํ–‰๋™(action) ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜์—ฌ ๋ณด์ƒ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค.

์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ์ตœ์ ์˜ ์ •์ฑ…(policy)์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

(๋ณด๋‹ค ์‰ฌ์šด ์ดํ•ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ์ฑ…์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š” )

์ด๋ฏธ์ง€ ์ถœ์ฒ˜ : R๋กœ ์‰ฝ๊ฒŒ ๋ฐฐ์šฐ๋Š” ๊ฐ•ํ™”ํ•™์Šต

๊ฐ•ํ™”ํ•™์Šต์˜ ํ•ต์‹ฌ ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. State (์ƒํƒœ, SSS): ํ™˜๊ฒฝ์—์„œ ์—์ด์ „ํŠธ๊ฐ€ ํ˜„์žฌ ์œ„์น˜ํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.
  2. Action (ํ–‰๋™, AAA): ํŠน์ • ์ƒํƒœ์—์„œ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์„ ํƒ์ง€์ž…๋‹ˆ๋‹ค.
  3. Reward (๋ณด์ƒ, RRR): ํŠน์ • ํ–‰๋™์„ ์ˆ˜ํ–‰ํ–ˆ์„ ๋•Œ ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›๋Š” ๋ณด์ƒ์ž…๋‹ˆ๋‹ค.
  4. Episode (์—ํ”ผ์†Œ๋“œ): ์‹œ์ž‘ ์ƒํƒœ์—์„œ ๋ชฉํ‘œ ์ƒํƒœ๊นŒ์ง€ ๋„๋‹ฌํ•˜๋Š” ์ผ๋ จ์˜ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.
  5. Policy (์ •์ฑ…, ฯ€\piฯ€): ์ƒํƒœ์—์„œ ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ์ „๋žต์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  6. State-Action Value Function (ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜, Q(S,A)Q(S,A)Q(S,A)): ํŠน์ • ์ƒํƒœ์—์„œ ํŠน์ • ํ–‰๋™์„ ์ˆ˜ํ–‰ํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€๋˜๋Š” ์žฅ๊ธฐ์ ์ธ ๋ณด์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Q-Learning์€ ์ฑ…์—์„œ ์†Œ๊ฐœํ•˜๋Š” ํ•ต์‹ฌ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, Q-Table์ด๋ผ๋Š” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ƒํƒœ์—์„œ์˜ ํ–‰๋™ ๊ฐ€์น˜๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

(For more. Q-Table์€ โ€œ6. Q-Table์ด๋ž€?โ€์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”)


  1. ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜(Q-Value Function)๋ž€?

Q-Learning์˜ ํ•ต์‹ฌ์€ ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜(Q-Value Function)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ํ–‰๋™ ๊ฐ€์น˜ ํ•จ์ˆ˜ Q(S,A)Q(S, A)Q(S,A)๋Š” ํŠน์ • ์ƒํƒœ SSS (StateStateState)์—์„œ ํŠน์ • ํ–‰๋™ AAA (ActionActionAction)๋ฅผ ์„ ํƒํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€๋˜๋Š” ๋ˆ„์  ๋ณด์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Q-Learning์—์„œ๋Š” Q-Table์ด๋ผ๋Š” ํ˜•ํƒœ๋กœ ๊ฐ ์ƒํƒœ์™€ ํ–‰๋™์— ๋Œ€ํ•œ ๊ฐ€์น˜๋ฅผ, Q-๊ฐ’์œผ๋กœ ์ €์žฅํ•˜๋ฉฐ, ํ•™์Šต์„ ํ†ตํ•ด ์ด ๊ฐ’์„ ์ง€์†์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.

  • Q-๊ฐ’์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•จ์œผ๋กœ์จ ์—์ด์ „ํŠธ๋Š” ๋” ๋†’์€ ๋ณด์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๊ฐ€์ง€๊ฒŒ ๋˜๋ฉฐ, ์ตœ์ ์˜ ์ •์ฑ…์„ ํ˜•์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  1. Q-Learning์˜ ํ•™์Šต ๊ณผ์ •

Q-Learning์˜ ํ•™์Šต์€ Q-Table์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

โœ”๏ธ Step 1: Q-Table ์ดˆ๊ธฐํ™”

  • ๋ชจ๋“  Q-๊ฐ’์„ 0์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

Q(S,A)=0โˆ€S,AQ(S, A) = 0 \quad \forall S, AQ(S,A)=0โˆ€S,A

โœ”๏ธ Step 2: ํ–‰๋™ ์„ ํƒ (Exploration vs. Exploitation)

  • ์—์ด์ „ํŠธ๋Š” ํƒ์ƒ‰(Exploration)๊ณผ ํ™œ์šฉ(Exploitation)์„ ์กฐํ•ฉํ•˜์—ฌ ํ–‰๋™์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ฯต\epsilonฯต-ํƒ์š•(ฯต\epsilonฯต-greedy) ์ •์ฑ…์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

{ํ™•๋ฅ ย (1โˆ’ฯต)โ‡’์ตœ์ ย ํ–‰๋™ย ์„ ํƒย (Exploitation)ํ™•๋ฅ ย ฯตโ‡’๋žœ๋คย ํ–‰๋™ย ์„ ํƒย (Exploration)\begin{cases} \text{ํ™•๋ฅ  } (1 - \epsilon) \Rightarrow \text{์ตœ์  ํ–‰๋™ ์„ ํƒ (Exploitation)} \ \text{ํ™•๋ฅ  } \epsilon \Rightarrow \text{๋žœ๋ค ํ–‰๋™ ์„ ํƒ (Exploration)} \end{cases}{ํ™•๋ฅ ย (1โˆ’ฯต)โ‡’์ตœ์ ย ํ–‰๋™ย ์„ ํƒย (Exploitation)ํ™•๋ฅ ย ฯตโ‡’๋žœ๋คย ํ–‰๋™ย ์„ ํƒย (Exploration)โ€‹

โœ”๏ธ Step 3: ๋ณด์ƒ ๊ด€์ฐฐ ๋ฐ Q-๊ฐ’ ์—…๋ฐ์ดํŠธ

  • Q-๊ฐ’์€ ์•„๋ž˜ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹(Bellman Equation)์„ ์ด์šฉํ•ด ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค.

    Q(S,A)โ†Q(S,A)+ฮฑ[R+ฮณmaxโกaโ€ฒQ(Sโ€ฒ,aโ€ฒ)โˆ’Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha \Big[ R + \gamma \max_{aโ€™} Q(Sโ€™, aโ€™) - Q(S, A) \Big]Q(S,A)โ†Q(S,A)+ฮฑ[R+ฮณaโ€ฒmaxโ€‹Q(Sโ€ฒ,aโ€ฒ)โˆ’Q(S,A)]

    • Q(S,A)Q(S, A)Q(S,A) : ํ˜„์žฌ ์ƒํƒœ SSS์—์„œ ํ–‰๋™ AAA๋ฅผ ํ–ˆ์„ ๋•Œ์˜ Q-๊ฐ’
    • ฮฑ\alphaฮฑ: ํ•™์Šต๋ฅ  (Learning Rate, 0~1) โ†’ ์—…๋ฐ์ดํŠธ ๋น„์œจ์„ ์กฐ์ ˆ
    • ฮณ\gammaฮณ: ํ• ์ธ์œจ (Discount Factor, 0~1)
    • RRR: ํ˜„์žฌ ํ–‰๋™์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ๋ฐ›์€ ๋ณด์ƒ (Reward)
    • maxโกaโ€ฒQ(Sโ€ฒ,aโ€ฒ)\max_{aโ€™} Q(Sโ€™, aโ€™)maxaโ€ฒโ€‹Q(Sโ€ฒ,aโ€ฒ): ๋‹ค์Œ ์ƒํƒœ์—์„œ ์ตœ์  ํ–‰๋™์˜ Q-๊ฐ’

์ด ๊ณผ์ •์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜๋ณตํ•˜๋ฉด Q-๊ฐ’์ด ์ˆ˜๋ ดํ•˜๊ณ , ์ตœ์  ์ •์ฑ…์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ“ (์ฐธ๊ณ ) ๋ฒจ๋งŒ ๋ฐฉ์ •์‹ (Bellman Equation) ์„ค๋ช…

  • ๋ฒจ๋งŒ ๋ฐฉ์ •์‹(Bellman Equation)์€ ์ตœ์  ์ •์ฑ…(Optimal Policy)์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์ƒํƒœ(State)์™€ ํ–‰๋™(Action) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ˆ˜์‹ํ™”ํ•œ ์‹์ž…๋‹ˆ๋‹ค.
    • ์ด๋Š” Q-Learning๊ณผ ๊ฐ™์€ ๊ฐ•ํ™”ํ•™์Šต์—์„œ Q-๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ํ•ต์‹ฌ ์›๋ฆฌ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฒจ๋งŒ ๋ฐฉ์ •์‹์€ ํ˜„์žฌ ์ƒํƒœ์—์„œ์˜ ์ตœ์  Q-๊ฐ’์„ ๋ฏธ๋ž˜์˜ ๊ธฐ๋Œ€ ๋ณด์ƒ(Discounted Future Reward)์œผ๋กœ ํ‘œํ˜„ํ•œ ์žฌ๊ท€ ๋ฐฉ์ •์‹์ž…๋‹ˆ๋‹ค.

  1. ฮต-greedy ๊ธฐ๋ฒ•๊ณผ Exploration & Exploitation ์ „๋žต

Q-Learning์—์„œ๋Š” ํƒ์ƒ‰(Exploration)๊ณผ ํ™œ์šฉ(Exploitation)์˜ ๊ท ํ˜•์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ํƒ์ƒ‰์ด ๋ถ€์กฑํ•˜๋ฉด ์ตœ์  ํ–‰๋™์„ ์ฐพ์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๊ณ , ๋ฐ˜๋Œ€๋กœ ํƒ์ƒ‰์ด ๊ณผํ•˜๋ฉด ์ตœ์  ์ •์ฑ… ์ˆ˜๋ ด์ด ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ฯต\epsilonฯต-greedy ์ „๋žต์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ํƒ์ƒ‰(Exploration):

    • ์ƒˆ๋กœ์šด ํ–‰๋™์„ ์‹œ๋„ํ•˜์—ฌ ๋” ๋‚˜์€ ๋ณด์ƒ์„ ์ฐพ์Œ.
    • ๋ฏธ์ง€์˜ ์˜์—ญ์„ ํƒํ—˜ํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ์ „๋žต์„ ๋ฐœ๊ฒฌํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ž„.
  • ํ™œ์šฉ(Exploitation):

    • ํ˜„์žฌ๊นŒ์ง€ ํ•™์Šตํ•œ Q-๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•จ.
    • ์ง€๊ธˆ๊นŒ์ง€์˜ ๊ฒฝํ—˜์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ ๋ณด์ƒ์„ ์ฃผ๋Š” ์„ ํƒ์ง€๋ฅผ ์ทจํ•จ.

Image Source: https://rogermartin.medium.com/balancing-exploration-and-exploitation

ฯต-greedy ๊ธฐ๋ฒ•

ฯต\epsilonฯต-greedy ๊ธฐ๋ฒ•์€ ํŠน์ • ํ™•๋ฅ (ฯต\epsilonฯต)๋กœ ํƒ์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๋‚˜๋จธ์ง€ ํ™•๋ฅ (1โˆ’ฯต1-\epsilon1โˆ’ฯต)์—์„œ๋Š” ํ˜„์žฌ ์ตœ์ ์˜ ํ–‰๋™์„ ์„ ํƒํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • ์ดˆ๋ฐ˜์—๋Š” ํƒ์ƒ‰์„ ๋” ๋งŽ์ด ์ˆ˜ํ–‰ํ•˜๋„๋ก ์„ค์ •ํ•˜๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก ํ™œ์šฉ ๋น„์œจ์„ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด ์ผ๋ฐ˜์ ์ž…๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) ฯต-greedy ๊ธฐ๋ฒ•์€ on-policy์ธ๊ฐ€?

=> ๊ฒฐ๋ก ๋ถ€ํ„ฐ ๋งํ•˜์ž๋ฉด, NO!! ฯต-greedy ๊ธฐ๋ฒ•์€ off-policy๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • RL(๊ฐ•ํ™”ํ•™์Šต)์—์„œ ๋งํ•˜๋Š” Policy(์ •์ฑ…, ฯ€\piฯ€)๋Š” ์—์ด์ „ํŠธ๊ฐ€ ํŠน์ • ์ƒํƒœ(SSS)์—์„œ ํŠน์ • ํ–‰๋™(AAA)์„ ์„ ํƒํ•˜๋Š” ๊ทœ์น™์ž…๋‹ˆ๋‹ค.
    • ์ฆ‰, โ€œํ˜„์žฌ ์ƒํƒœ์—์„œ ์–ด๋–ค ํ–‰๋™์„ ํ•ด์•ผ ํ•˜๋Š”๊ฐ€?โ€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ฯต\epsilonฯต-greedy๋Š” ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด์ง€, ์ •์ฑ… ์ž์ฒด๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹™๋‹ˆ๋‹ค.
    • ์ฆ‰, โ€œํƒ์ƒ‰์„ ์–ผ๋งˆ๋‚˜ ํ•  ๊ฒƒ์ธ๊ฐ€?โ€๋ฅผ ์กฐ์ •ํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์ด์ง€, ์–ด๋–ค ํ–‰๋™์ด ์ตœ์ ์ธ์ง€๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.
๋น„๊ต ํ•ญ๋ชฉ On-Policy Learning Off-Policy Learning
ํ•™์Šตํ•˜๋Š” ์ •์ฑ…๊ณผ ์‹คํ–‰ํ•˜๋Š” ์ •์ฑ… ๋™์ผํ•จ (๊ฐ™์€ ์ •์ฑ…์„ ํ•™์Šต) ๋‹ค๋ฆ„ (์ตœ์  ์ •์ฑ…์„ ํ•™์Šต)
ํƒ์ƒ‰๊ณผ ํ™œ์šฉ์˜ ๊ด€๊ณ„ ํƒ์ƒ‰๊ณผ ํ™œ์šฉ์ด ํ•จ๊ป˜ ์ด๋ฃจ์–ด์ง ํƒ์ƒ‰์„ ํ†ตํ•œ ๊ฒฝํ—˜์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์  ํ–‰๋™ ํ•™์Šต
๊ฒฝํ—˜ ์žฌ์‚ฌ์šฉ(Experience Replay) ์‚ฌ์šฉ ์–ด๋ ค์›€ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ)
์˜ˆ์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ SARSA, PPO Q-Learning, DQN

  1. ํ• ์ธ์œจ(Discount Factor)์˜ ์—ญํ• 

Q-Learning์—์„œ๋Š” ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ๊ฐ€์น˜๋ฅผ ํ˜„์žฌ ๋ณด์ƒ๊ณผ ๋น„๊ตํ•  ๋•Œ ํ• ์ธ์œจ(Discount Factor, ฮณ\gammaฮณ)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ํ• ์ธ์œจ์ด ๋†’์„์ˆ˜๋ก ๋จผ ๋ฏธ๋ž˜์˜ ๋ณด์ƒ๊นŒ์ง€ ๊ณ ๋ คํ•˜๋Š” ๋ฐ˜๋ฉด, ๋‚ฎ์„์ˆ˜๋ก ๊ฐ€๊นŒ์šด ๋ณด์ƒ์„ ๋” ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊น๋‹ˆ๋‹ค.

    • ฮณ\gammaฮณ ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก โ†’ ์žฅ๊ธฐ์ ์ธ ๋ณด์ƒ์„ ์ค‘์‹œ (์˜ˆ: ๋งˆ๋ผํ†ค ๊ฒฝ๊ธฐ ์ „๋žต ์ˆ˜๋ฆฝ)
    • ฮณ\gammaฮณ ๊ฐ’์ด 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก โ†’ ์ฆ‰๊ฐ์ ์ธ ๋ณด์ƒ์„ ์ค‘์‹œ (์˜ˆ: ๋‹น์žฅ ์ด๋“์„ ์–ป๋Š” ๋„๋ฐ• ์ „๋žต)

์ผ๋ฐ˜์ ์œผ๋กœ ฮณ\gammaฮณ๋Š” 0.9~0.99 ์‚ฌ์ด๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

  • ๋†’์€ ํ• ์ธ์œจ์€ ์žฅ๊ธฐ์ ์ธ ๋ชฉํ‘œ๋ฅผ ๊ณ ๋ คํ•˜๋ฉฐ, ๋‚ฎ์€ ํ• ์ธ์œจ์€ ์ฆ‰๊ฐ์ ์ธ ๊ฒฐ๊ณผ์— ์ง‘์ค‘ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ค์ •์€ ๋ฌธ์ œ์˜ ํŠน์„ฑ๊ณผ ๋ชฉํ‘œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Q-Table์ด๋ž€?

Q-Table์€ ๊ฐ ์ƒํƒœ(state)์—์„œ ๊ฐ€๋Šฅํ•œ ํ–‰๋™(action)์— ๋Œ€ํ•œ ์˜ˆ์ƒ ๋ณด์ƒ ๊ฐ’(Q-Value)์„ ์ €์žฅํ•˜๋Š” ํ…Œ์ด๋ธ”์ž…๋‹ˆ๋‹ค.

  • Q-Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋Š” ์ด ํ…Œ์ด๋ธ”์„ ์ ์ง„์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๋ฉด์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

Q-Table์˜ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ƒํƒœ(State) ํ–‰๋™ 1 ํ–‰๋™ 2 ํ–‰๋™ 3 ํ–‰๋™ 4
S1 0.5 0.2 -0.1 0.0
S2 0.0 0.8 0.3 -0.5
S3 -0.3 0.7 0.5 0.2

์—ฌ๊ธฐ์„œ ๊ฐ ์…€์˜ ๊ฐ’(Q-Value)์€ ํ•ด๋‹น ์ƒํƒœ์—์„œ ํŠน์ • ํ–‰๋™์„ ์ˆ˜ํ–‰ํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณด์ƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ Q-Table์˜ ๊ฐ’์ด ์ ์  ๋” ์ •ํ™•ํ•œ ๋ณด์ƒ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.

  1. Q-Learning์˜ ๊ตฌํ˜„ ์˜ˆ์ œ

(์ฐธ๊ณ ์šฉ) ์•„๋ž˜๋Š” ๊ฐ„๋‹จํ•œ Q-Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๋Š” Python ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np

# ํ™˜๊ฒฝ ์„ค์ •
n_states = 16  # ์ƒํƒœ ์ˆ˜
n_actions = 4  # ํ–‰๋™ ์ˆ˜ (์ƒ, ํ•˜, ์ขŒ, ์šฐ)
goal_state = 15  # ๋ชฉํ‘œ ์ƒํƒœ

# Q-Table ์ดˆ๊ธฐํ™”
Q_table = np.zeros((n_states, n_actions))

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000

# Q-Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜
for epoch in range(epochs):
    current_state = np.random.randint(0, n_states)  # ๋ฌด์ž‘์œ„ ์ƒํƒœ์—์„œ ์‹œ์ž‘
    
    while current_state != goal_state:
        # ํ–‰๋™ ์„ ํƒ (epsilon-greedy ์ „๋žต)
        if np.random.rand() < exploration_prob:
            action = np.random.randint(0, n_actions)  # ํƒ์ƒ‰
        else:
            action = np.argmax(Q_table[current_state])  # ํ™œ์šฉ

        # ํ™˜๊ฒฝ์—์„œ ๋‹ค์Œ ์ƒํƒœ๋กœ ์ด๋™ (๋‹จ์ˆœํ™”๋œ ์ด๋™)
        next_state = (current_state + 1) % n_states

        # ๋ณด์ƒ ์ •์˜ (๋ชฉํ‘œ ์ƒํƒœ ๋„๋‹ฌ ์‹œ ๋ณด์ƒ ๋ถ€์—ฌ)
        reward = 1 if next_state == goal_state else 0

        # Q-๊ฐ’ ์—…๋ฐ์ดํŠธ (๋ฒจ๋งŒ ๋ฐฉ์ •์‹ ์ ์šฉ)
        Q_table[current_state, action] += learning_rate * \
            (reward + discount_factor * np.max(Q_table[next_state]) - Q_table[current_state, action])
        
        current_state = next_state  # ๋‹ค์Œ ์ƒํƒœ๋กœ ์ด๋™

# ํ•™์Šต๋œ Q-Table ์ถœ๋ ฅ
print("ํ•™์Šต๋œ Q-Table:")
print(Q_table)

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด Q-Table์ด ํ•™์Šต๋˜๋ฉด์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


  1. Model-Based RL๊ณผ Q-Learning์˜ ์ฐจ์ด

Q-Learning์€ Model-Free RL(๋ชจ๋ธ์ด ์—†๋Š” ๊ฐ•ํ™”ํ•™์Šต) ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • ์ด๋Š” ํ™˜๊ฒฝ์˜ ์ „์ด ํ™•๋ฅ (Transition Probability)์ด๋‚˜ ๋ณด์ƒ ํ•จ์ˆ˜(Reward Function)๋ฅผ ๋ฏธ๋ฆฌ ์•Œ ํ•„์š” ์—†์ด, ์ง์ ‘ ๊ฒฝํ—˜์„ ํ†ตํ•ด ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ฉด, Model-Based RL(๋ชจ๋ธ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต)์€ ํ™˜๊ฒฝ์˜ ๋™์ž‘ ๋ชจ๋ธ์„ ๋ฏธ๋ฆฌ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์ œ๊ณต๋ฐ›์•„ ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ตœ์  ์ •์ฑ…์„ ๋„์ถœํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
๋น„๊ต ํ•ญ๋ชฉ Model-Free RL (Q-Learning) Model-Based RL
ํ™˜๊ฒฝ ๋ชจ๋ธ ์ €์žฅ ์—ฌ๋ถ€ โŒ (์ €์žฅ ์•ˆ ํ•จ) โœ… (ํ™˜๊ฒฝ ๋ชจ๋ธ ์ €์žฅ)
ํ•™์Šต ๋ฐฉ์‹ ์ง์ ‘ ๊ฒฝํ—˜์„ ํ†ตํ•ด ํ•™์Šต ํ™˜๊ฒฝ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์˜ˆ์ธก
์˜ˆ์ธก ๊ฐ€๋Šฅ ์—ฌ๋ถ€ โŒ (๋ฏธ๋ž˜ ์ƒํƒœ ์˜ˆ์ธก ๋ถˆ๊ฐ€) โœ… (ํ™˜๊ฒฝ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ˆ์ธก ๊ฐ€๋Šฅ)
ํ•™์Šต ๋น„์šฉ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Œ ๋ชจ๋ธ ํ•™์Šต๊ณผ ์˜ˆ์ธก ๋น„์šฉ์ด ํผ
์˜ˆ์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ Q-Learning, SARSA ๋‹ค์ด๋‚ด๋ฏน ํ”„๋กœ๊ทธ๋ž˜๋ฐ(DP), ๋ชฌํ…Œ์นด๋ฅผ๋กœ ํŠธ๋ฆฌ ํƒ์ƒ‰(MCTS)

Q-Learning์ด Model-Free์ธ ์ด์œ 

  • Q-Learning์€ ํ™˜๊ฒฝ์˜ ๋™์ž‘ ๋ชจ๋ธ(์ „์ด ํ™•๋ฅ  ๋ฐ ๋ณด์ƒ ํ•จ์ˆ˜)์„ ์ง์ ‘ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ตœ์ ์˜ ํ–‰๋™์„ ์ฐพ์Šต๋‹ˆ๋‹ค.

    • (1) ๊ฒฝํ—˜์„ ํ†ตํ•ด Q-๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉ.
    • (2) ํ™˜๊ฒฝ ๋ชจ๋ธ์„ ๋”ฐ๋กœ ๊ตฌ์ถ•ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ํ•™์Šต์ด ๋” ์œ ์—ฐํ•˜๊ณ  ๋‹จ์ˆœ.
  • ๋ฐ˜๋ฉด, Model-Based RL์—์„œ๋Š” ํ™˜๊ฒฝ์˜ ์ „์ด ํ™•๋ฅ ๊ณผ ๋ณด์ƒ์„ ๋ชจ๋ธ๋งํ•˜์—ฌ ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์‹ค์ œ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด๋„ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) Q-Learning์€ ํ™˜๊ฒฝ ๋ชจ๋ธ ์—†์ด๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒŒ์ž„ AI, ๋กœ๋ด‡ ์ œ์–ด, ๋„คํŠธ์›Œํฌ ์ตœ์ ํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ์‹ค์ œ ๋ฌธ์ œ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.


  1. Q-Learning์˜ ํ•œ๊ณ„์ (?)

๋ฌด์ž‘์ • ๋ชจ๋“  ์ƒํƒœ(state)์— ๋Œ€ํ•ด Q-Table์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ ๋‚ญ๋น„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ํŠนํžˆ ์ƒํƒœ ๊ณต๊ฐ„(state space)์ด ๋งค์šฐ ํฌ๊ฑฐ๋‚˜ ์—ฐ์†์ ์ธ ๊ฒฝ์šฐ, ํ…Œ์ด๋ธ” ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๊ณผ๋„ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋˜์—ˆ์œผ๋ฉฐ, ํ•œ๊ฐ€์ง€๋กœ ํ•จ์ˆ˜ ๊ทผ์‚ฌ(Function Approximation) ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์˜ˆ๋กœ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ํ•จ์ˆ˜ ๊ทผ์‚ฌ(Function Approximation)๋ž€?

  • ์‹ ๊ฒฝ๋ง(Neural Network) ๋˜๋Š” ์„ ํ˜• ํšŒ๊ท€(Linear Regression) ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ Q-Table์„ ์ง์ ‘ ์ €์žฅํ•˜๋Š” ๋Œ€์‹  ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜: Deep Q-Network (DQN)
      • ์žฅ์ : ์ƒํƒœ ๊ณต๊ฐ„์ด ํฌ๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ณ  ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ
      • ๋‹จ์ : ์‹ ๊ฒฝ๋ง ํ•™์Šต์— ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ณ , ํ•™์Šต ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

์ด๋ฏธ์ง€ ์ถœ์ฒ˜ : ์ด๊ฒƒ์ €๊ฒƒ ํ…Œํฌ ๋ธ”๋กœ๊ทธ - DQN

(์ฐธ๊ณ ) Q-Learning์€ ๊ณ ์ „ ๊ฐ•ํ™”ํ•™์Šต(classic reinforcement learning) ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค.

  • (WHY?) ์™œ๋ƒํ•˜๋ฉด Q-learning์€ ๊ฐ•ํ™”ํ•™์Šต์ด ๋ณธ๊ฒฉ์ ์œผ๋กœ ์—ฐ๊ตฌ๋˜๊ธฐ ์‹œ์ž‘ํ•œ 1980~1990๋…„๋Œ€์— ๋“ฑ์žฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, ๋น„๊ต์  ๋‹จ์ˆœํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒํƒœ-ํ–‰๋™(State-Action) ๊ฐ’(Q-Value)์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก Q-Learning๊ณผ Deep Q-Network์˜ ์ฐจ์ด

๋น„๊ต ํ•ญ๋ชฉ Q-Learning Deep Q-Network (DQN)
ํ‘œํ˜„ ๋ฐฉ์‹ Q-Table(ํ…Œ์ด๋ธ” ์ €์žฅ) ์‹ ๊ฒฝ๋ง(Deep Neural Network)
์ƒํƒœ ๊ณต๊ฐ„ ์ž‘๊ฑฐ๋‚˜ ์ด์‚ฐ์ (Discrete) ์—ฐ์†์ ์ธ ๊ณต๊ฐ„๋„ ํ•™์Šต ๊ฐ€๋Šฅ
ํ•™์Šต ๋ฐฉ๋ฒ• ๊ฒฝํ—˜์„ ํ†ตํ•œ Q-๊ฐ’ ์—…๋ฐ์ดํŠธ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด Q-๊ฐ’ ๊ทผ์‚ฌ
ํ•™์Šต ํšจ์œจ์„ฑ ์ž‘์€ ํ™˜๊ฒฝ์—์„œ๋Š” ํšจ์œจ์  ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ๋„ ํ™œ์šฉ ๊ฐ€๋Šฅ
๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ƒํƒœ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ์ฆ๊ฐ€ ์ƒํƒœ ๊ณต๊ฐ„์ด ์ปค๋„ ํ•™์Šต ๊ฐ€๋Šฅ

โœ… ์ •๋ฆฌ

  • Q-Learning์€ ๊ณ ์ „์ ์ธ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, ๋น„๊ต์  ๋‹จ์ˆœํ•œ ํ…Œ์ด๋ธ” ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Deep Q-Network(DQN) ๊ฐ™์€ ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ๋” ํ™•์žฅ๋œ ํ˜•ํƒœ๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ฒฐ๋ก 

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ํ˜ํŽœํ•˜์ž„๋‹˜์˜ใ€ŽEasy! ๋”ฅ๋Ÿฌ๋‹ใ€์ฑ…์„ ๋ณด๋ฉด์„œ ๋” ์‚ดํŽด๋ณด๊ณ  ์‹ถ์—ˆ๋˜ ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹ฌํ™”๋‚ด์šฉ๊ณผ, Q-Learning์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Q-Learning์€ ๊ฐ•ํ™”ํ•™์Šต์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Q-Learning ํŠน์ง• ์š”์•ฝ

  • Q-Table์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต
  • ๋ฒจ๋งŒ ๋ฐฉ์ •์‹ ๊ธฐ๋ฐ˜์œผ๋กœ Q-๊ฐ’ ์—…๋ฐ์ดํŠธ
  • ฯต\epsilonฯต-ํƒ์š• ์ •์ฑ…์„ ํ™œ์šฉํ•˜์—ฌ ํƒ์ƒ‰๊ณผ ํ™œ์šฉ์˜ ๊ท ํ˜• ์œ ์ง€

Q-Learning์˜ QQQ๋ผ๋Š” ์šฉ์–ด๋Š” โ€œAction-Valueโ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ธฐํ˜ธ๋กœ, ๋ฒจ๋งŒ ๋ฐฉ์ •์‹์—์„œ ์œ ๋ž˜ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ž…๋‹ˆ๋‹ค.

Q-Learning์„ ์ถฉ๋ถ„ํžˆ ์ดํ•ดํ•˜๊ณ  ๋‚˜๋ฉด, ์ตœ๊ทผ ๊ฐ•ํ™”ํ•™์Šต์˜ ๊ฝƒ์ธ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต(Deep Q-Network, DQN)์œผ๋กœ ํ™•์žฅํ•˜์—ฌ ๋” ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ฌผ๋ก  ์ „ ์—ฌ๊ธฐ๊นŒ์ง€)

์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋”์šฑ ๊นŠ์ด ์žˆ๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค!

์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ˜Ž



-->