[๊ฐœ๋…์ •๋ฆฌ] ์‹œ๊ณ„์—ด ์ธ๊ณผ๊ด€๊ณ„ ๋ถ„์„: Granger Causality

Posted by Euisuk's Dev Log on July 6, 2024

[๊ฐœ๋…์ •๋ฆฌ] ์‹œ๊ณ„์—ด ์ธ๊ณผ๊ด€๊ณ„ ๋ถ„์„: Granger Causality

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/๊ฐœ๋…์ •๋ฆฌ-์‹œ๊ณ„์—ด-๋ถ„์„-์ธ๊ณผ๊ด€๊ณ„-๋ถ„์„

์•ˆ๋…•ํ•˜์„ธ์š”๐Ÿ˜Š ์ง€๋‚œ ๋ฒˆ ํฌ์ŠคํŠธ์—์„œ โ€œ์ธ๊ณผ๊ด€๊ณ„์™€ ์ƒ๊ด€๊ด€๊ณ„์˜ ๊ฐœ๋…โ€(๋งํฌ)์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด์•˜๋Š”๋ฐ์š”.

์˜ค๋Š˜์€ ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ํ•ต์‹ฌ ๊ฐœ๋… ์ค‘ ํ•˜๋‚˜์ธ โ€˜๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„(Granger causality)โ€™์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ , ์ด๋ฅผ ์‹ค์ œ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•ด๋ณด๋Š” ์‹œ๊ฐ„์„ ๊ฐ€์ ธ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹ค์Šต์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” Kaggle ๋ฐ์ดํ„ฐ ์ค‘ ํ•˜๋‚˜์ธ Advertising Sales Dataset(๋งํฌ)๋กœ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๊ด‘๊ณ  ์˜ˆ์‚ฐ์ด ๋งค์ถœ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์„ ๊ฒ๋‹ˆ๋‹คโญ

  1. ์ธ๊ณผ๊ด€๊ณ„ vs ์ƒ๊ด€๊ด€๊ณ„

๋จผ์ € ๋ณต์Šต ์ฐจ์›์—์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ธ๊ณผ๊ด€๊ณ„์™€ ์ƒ๊ด€๊ด€๊ณ„์˜ ์ฐจ์ด๋ถ€ํ„ฐ ์•Œ์•„๋ด…์‹œ๋‹ค.

์ธ๊ณผ๊ด€๊ณ„ (Causal Relationship)

  • ์ •์˜: ํ•œ ์‚ฌ๊ฑด์ด ๋‹ค๋ฅธ ์‚ฌ๊ฑด์„ ์ง์ ‘์ ์œผ๋กœ ์•ผ๊ธฐํ•˜๋Š” ๊ด€๊ณ„
  • ์˜ˆ: ํก์—ฐ์ด ํ์•” ๋ฐœ๋ณ‘๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ด

์ƒ๊ด€๊ด€๊ณ„ (Correlation)

  • ์ •์˜: ๋‘ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์„ ํ˜•์  ๊ด€๋ จ์„ฑ์˜ ์ •๋„
  • ์˜ˆ: ์•„์ด์Šคํฌ๋ฆผ ํŒ๋งค๋Ÿ‰๊ณผ ๋ฒ”์ฃ„์œจ์ด ํ•จ๊ป˜ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ (์‹ค์ œ ์›์ธ์€ ๋”์šด ๋‚ ์”จ์ผ ์ˆ˜ ์žˆ์Œ)

์ฐจ์ด์ 

  • ๋ฐฉํ–ฅ์„ฑ: ์ธ๊ณผ๊ด€๊ณ„๋Š” ๋ฐฉํ–ฅ์ด ์žˆ์ง€๋งŒ, ์ƒ๊ด€๊ด€๊ณ„๋Š” ๋ฐฉํ–ฅ์„ฑ์ด ์—†์Šต๋‹ˆ๋‹ค.
  • ์›์ธ๊ณผ ๊ฒฐ๊ณผ: ์ธ๊ณผ๊ด€๊ณ„๋Š” ์›์ธ๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜์ง€๋งŒ, ์ƒ๊ด€๊ด€๊ณ„๋Š” ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  1. ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„(Granger causality)๋Š” 1969๋…„ ํด๋ผ์ด๋ธŒ ๊ทธ๋žœ์ €๊ฐ€ ์ œ์•ˆํ•œ ๊ฐœ๋…์œผ๋กœ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์—์„œ์˜ ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„์˜ ์ •์˜

(Def) X์˜ ๊ณผ๊ฑฐ ์ •๋ณด๊ฐ€ Y์˜ ๋ฏธ๋ž˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋„์›€์„ ์ค€๋‹ค๋ฉด, โ€œX๋Š” Y๋ฅผ ๊ทธ๋žœ์ € ์ธ๊ณผํ•œ๋‹คโ€๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ฏ ์ด๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

P(Y(t+1)โˆฃY(t),Y(tโˆ’1),โ€ฆ,X(t),X(tโˆ’1),โ€ฆ)โ‰ P(Y(t+1)โˆฃY(t),Y(tโˆ’1),โ€ฆ)P(Y(t+1) Y(t), Y(t-1), โ€ฆ, X(t), X(t-1), โ€ฆ) โ‰  P(Y(t+1) Y(t), Y(t-1), โ€ฆ)P(Y(t+1)โˆฃY(t),Y(tโˆ’1),โ€ฆ,X(t),X(tโˆ’1),โ€ฆ)๎€ โ€‹=P(Y(t+1)โˆฃY(t),Y(tโˆ’1),โ€ฆ)

์ด๋•Œ PPP๋Š” ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  1. ์ขŒ๋ณ€ P(Y(t+1) Y(t), Y(t-1), ..., X(t), X(t-1), ...):
    • Y์˜ ๋ฏธ๋ž˜ ๊ฐ’ Y(t+1)์„ ์˜ˆ์ธกํ•  ๋•Œ, Y์˜ ๊ณผ๊ฑฐ ๊ฐ’๋“ค(Y(t), Y(t-1), ...)๊ณผ X์˜ ๊ณผ๊ฑฐ ๊ฐ’๋“ค(X(t), X(t-1), ...)์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  2. ์šฐ๋ณ€ P(Y(t+1) Y(t), Y(t-1), ...):
    • Y์˜ ๋ฏธ๋ž˜ ๊ฐ’ Y(t+1)์„ ์˜ˆ์ธกํ•  ๋•Œ, Y์˜ ๊ณผ๊ฑฐ ๊ฐ’๋“ค(Y(t), Y(t-1), ...)๋งŒ์„ ์‚ฌ์šฉํ•œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  3. ๋ถ€๋“ฑํ˜ธ (โ‰ ):

    • ๋‘ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์ด ์„œ๋กœ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์œ„ ์ˆ˜์‹์˜ ์˜๋ฏธ:

  • ๋งŒ์•ฝ X๊ฐ€ Y๋ฅผ ๊ทธ๋žœ์ € ์ธ๊ณผํ•œ๋‹ค๋ฉด, X์˜ ๊ณผ๊ฑฐ ๊ฐ’๋“ค์„ ํฌํ•จํ•˜์—ฌ ์˜ˆ์ธกํ•œ Y์˜ ๋ฏธ๋ž˜ ๊ฐ’(์ขŒ๋ณ€)์ด Y์˜ ๊ณผ๊ฑฐ ๊ฐ’๋“ค๋งŒ์œผ๋กœ ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ(์šฐ๋ณ€)์™€ ๋‹ค๋ฅผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ฆ‰, X์˜ ์ •๋ณด๊ฐ€ Y์˜ ์˜ˆ์ธก์— ์œ ์˜๋ฏธํ•œ ์˜ํ–ฅ์„ ์ค€๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๐Ÿ’ก ์‹ค์ œ ์ ์šฉ:

  • ์ด ๊ฐœ๋…์„ ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ •ํ•˜๊ธฐ ์œ„ํ•ด, ๋ณดํ†ต ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๊ฒฝ์šฐ(X ํฌํ•จ vs X ๋ฏธํฌํ•จ)์˜ ์˜ˆ์ธก ์˜ค์ฐจ๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
  • F-๊ฒ€์ •์ด๋‚˜ ๋‹ค๋ฅธ ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ชจ๋ธ ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์œ ์˜๋ฏธํ•œ์ง€ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.

โš ๏ธ ์ฃผ์˜์ :

  • ์ด๋Š” ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ์ธ๊ณผ๊ด€๊ณ„์ด๋ฉฐ, ์ง„์ •ํ•œ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์™„๋ฒฝํžˆ ์ฆ๋ช…ํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.
  • ์ˆจ๊ฒจ์ง„ ๋ณ€์ˆ˜๋‚˜ ๊ณตํ†ต ์›์ธ ๋“ฑ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ํ•ญ์ƒ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ์œ„ ์ˆ˜์‹์€ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋ฅผ ๊ฐ„๊ฒฐํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ๋„๊ตฌ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  1. ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ vs ์ธ๊ณผ๊ด€๊ณ„

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„์™€ ์ผ๋ฐ˜์ ์ธ ์ธ๊ณผ๊ด€๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค:

  1. ์ •์˜:

    • ์ธ๊ณผ๊ด€๊ณ„: A๊ฐ€ B์˜ ์ง์ ‘์ ์ธ ์›์ธ์ด ๋จ
    • ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„: A์˜ ๊ณผ๊ฑฐ ์ •๋ณด๊ฐ€ B์˜ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋จ
  2. ๊ธฐ๋ฐ˜:

    • ์ธ๊ณผ๊ด€๊ณ„: ๋…ผ๋ฆฌ์ , ์‹คํ—˜์  ์ฆ๊ฑฐ์— ๊ธฐ๋ฐ˜
    • ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„: ํ†ต๊ณ„์  ์˜ˆ์ธก๋ ฅ์— ๊ธฐ๋ฐ˜
  3. ํ•ด์„:

    • ์ธ๊ณผ๊ด€๊ณ„: โ€œA๊ฐ€ B๋ฅผ ์•ผ๊ธฐํ•œ๋‹คโ€
    • ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„: โ€œA๊ฐ€ B๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹คโ€
  4. ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๋ชจํ˜•

๋ชจํ˜•์˜ ์ „์ œ

  1. ์‹œ๊ฐ„ ์„ ํ›„๊ด€๊ณ„: ๊ณผ๊ฑฐ์˜ ์‚ฌ๊ฑด์€ ํ˜„์žฌ์˜ ์‚ฌ๊ฑด์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฏธ๋ž˜์˜ ์‚ฌ๊ฑด์€ ํ˜„์žฌ์˜ ์‚ฌ๊ฑด์„ ์œ ๋ฐœํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  2. ์ •์ƒ์„ฑ: ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์„ฑ(stationary)์„ ๊ฐ€์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ผ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ž…๋ ฅ์‹œ์ฐจ: ์˜ˆ์ƒ๋˜๋Š” ์‹œ์ฐจ๋งŒํผ์˜ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ž…๋ ฅ๋ณ€์ˆ˜๋กœ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  4. ์ตœ์ข…์‹œ์ฐจ: ์ ์ ˆํ•œ ์‹œ์ฐจ๋ฅผ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต ์—ฐ ํ™˜์‚ฐ ๋นˆ๋„์˜ 2~3๋ฐฐ๊นŒ์ง€ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.

๋ชจํ˜•์˜ ๊ฒ€์ •

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:

  1. ๊ท€๋ฌด๊ฐ€์„ค ์„ค์ •: โ€œX๋Š” Y๋ฅผ ๊ทธ๋žœ์ € ์ธ๊ณผํ•˜์ง€ ์•Š๋Š”๋‹คโ€
  2. ์ œํ•œ ๋ชจ๋ธ๊ณผ ๋น„์ œํ•œ ๋ชจ๋ธ ๊ตฌ์ถ•:
    • ์ œํ•œ ๋ชจ๋ธ: Y(t) = ฮฑ + ฮฒ1Y(t-1) + ฮฒ2Y(t-2) + โ€ฆ + ฮฒp*Y(t-p) + ฮต
    • ๋น„์ œํ•œ ๋ชจ๋ธ: Y(t) = ฮฑ + ฮฒ1Y(t-1) + โ€ฆ + ฮฒpY(t-p) + ฮณ1X(t-1) + โ€ฆ + ฮณpX(t-p) + ฮต
  3. F-ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
  4. p-value ํ™•์ธ ๋ฐ ๊ฒฐ๋ก  ๋„์ถœ

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์‹œ

statsmodels ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import grangercausalitytests

# ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
np.random.seed(1)
x = np.random.randn(1000)
y = np.random.randn(1000)
y[1:] += 0.5 * x[:-1]  # X๊ฐ€ Y๋ฅผ ๊ทธ๋žœ์ € ์ธ๊ณผํ•˜๋„๋ก ์„ค์ •

data = pd.DataFrame({'X': x, 'Y': y})

# ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •
result = grangercausalitytests(data[['Y', 'X']], maxlag=5)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
for lag, values in result.items():
    print(f"Lag {lag}:")
    print(f"  F-statistic: {values[0]['ssr_ftest'][0]}")
    print(f"  p-value: {values[0]['ssr_ftest'][1]}")
  1. ๋ชจํ˜•์˜ ์‘์šฉ

VAR์„ ํ™œ์šฉํ•œ ์‘์šฉ

VAR์ด๋ž€?

  • VAR(Vector Autoregression)์€ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋™์‹œ์— ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์ž์‹ ์˜ ๊ณผ๊ฑฐ ๊ฐ’๊ณผ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋“ค์˜ ๊ณผ๊ฑฐ ๊ฐ’์— ์˜ํ•ด ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

VAR(p) ๋ชจ๋ธ (p์ฐจ ๋ฒกํ„ฐ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ):

Yt=c+A1Ytโˆ’1+A2Ytโˆ’2+โ€ฆ+ApYtโˆ’p+ฯตtY_t = c + A_1Y_{t-1} + A_2Y_{t-2} + โ€ฆ + A_pY_{t-p} + \epsilon_tYtโ€‹=c+A1โ€‹Ytโˆ’1โ€‹+A2โ€‹Ytโˆ’2โ€‹+โ€ฆ+Apโ€‹Ytโˆ’pโ€‹+ฯตtโ€‹

์—ฌ๊ธฐ์„œ:

  • YtY_tYtโ€‹ ๋Š” ์‹œ์  t์—์„œ์˜ k์ฐจ์› ์‹œ๊ณ„์—ด ๋ฒกํ„ฐ
  • ccc ๋Š” k์ฐจ์› ์ƒ์ˆ˜ ๋ฒกํ„ฐ
  • AiA_iAiโ€‹ ๋Š” k ร— k ๊ณ„์ˆ˜ ํ–‰๋ ฌ
  • ฯตt\epsilon_tฯตtโ€‹ ๋Š” k์ฐจ์› ๋ฐฑ์ƒ‰ ์žก์Œ ๋ฒกํ„ฐ

VAR์„ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•ด์„œ ์‘์šฉํ•˜๋Š”๊ฐ€?

  1. ๋ณ€์ˆ˜ ๊ฐ„ ๋™์  ๊ด€๊ณ„ ๋ถ„์„: VAR ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋™์‹œ์— ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ์ถฉ๊ฒฉ๋ฐ˜์‘ํ•จ์ˆ˜ ๋ถ„์„: ํ•œ ๋ณ€์ˆ˜์˜ ๋ณ€ํ™”๊ฐ€ ๋‹ค๋ฅธ ๋ณ€์ˆ˜์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ถ”์ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก: ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜์˜ ๋ฏธ๋ž˜ ๊ฐ’์„ ๋™์‹œ์— ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์‹œ:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller

# ๋ฐ์ดํ„ฐ ์ •์ƒํ™” (ํ•„์š”ํ•œ ๊ฒฝ์šฐ)
def difference(data):
    return data.diff().dropna()

# ADF ํ…Œ์ŠคํŠธ๋กœ ์ •์ƒ์„ฑ ํ™•์ธ
def adf_test(series):
    result = adfuller(series)
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')

# VAR ๋ชจ๋ธ ์ ์šฉ
model = VAR(data)
results = model.fit(maxlags=15, ic='aic')

# ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •
granger_results = results.test_causality('Y', 'X', kind='f')
print(granger_results.summary())

ARIMA๋ฅผ ํ™œ์šฉํ•œ ์‘์šฉ

ARIMA๋ž€?

  • ARIMA(AutoRegressive Integrated Moving Average)๋Š” ๋‹จ์ผ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ž๊ธฐํšŒ๊ท€(AR), ์ฐจ๋ถ„(I), ์ด๋™ํ‰๊ท (MA) ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ์‹œ๊ณ„์—ด ํŒจํ„ด์„ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค.

ARIMA(p,d,q) ๋ชจ๋ธ:

ฯ•(B)(1โˆ’B)dXt=ฮธ(B)ฯตt\phi(B)(1-B)^d X_t = \theta(B)\epsilon_tฯ•(B)(1โˆ’B)dXtโ€‹=ฮธ(B)ฯตtโ€‹

์—ฌ๊ธฐ์„œ:

  • BBB ๋Š” ํ›„ํ–‰ ์—ฐ์‚ฐ์ž (BXt=Xtโˆ’1BX_t = X_{t-1}BXtโ€‹=Xtโˆ’1โ€‹)
  • ddd ๋Š” ์ฐจ๋ถ„ ์ฐจ์ˆ˜
  • ฯ•(B)=1โˆ’ฯ•1Bโˆ’ฯ•2B2โˆ’โ€ฆโˆ’ฯ•pBp\phi(B) = 1 - \phi_1B - \phi_2B^2 - โ€ฆ - \phi_pB^pฯ•(B)=1โˆ’ฯ•1โ€‹Bโˆ’ฯ•2โ€‹B2โˆ’โ€ฆโˆ’ฯ•pโ€‹Bp (AR ๋‹คํ•ญ์‹)
  • ฮธ(B)=1+ฮธ1B+ฮธ2B2+โ€ฆ+ฮธqBq\theta(B) = 1 + \theta_1B + \theta_2B^2 + โ€ฆ + \theta_qB^qฮธ(B)=1+ฮธ1โ€‹B+ฮธ2โ€‹B2+โ€ฆ+ฮธqโ€‹Bq (MA ๋‹คํ•ญ์‹)
  • ฯตt\epsilon_tฯตtโ€‹ ๋Š” ๋ฐฑ์ƒ‰ ์žก์Œ ๊ณผ์ •

ARIMA๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•ด์„œ ์‘์šฉํ•˜๋Š”๊ฐ€?

  1. ๋‹จ๋ณ€๋Ÿ‰ ์‹œ๊ณ„์—ด ๋ถ„์„: ๊ฐ ๋ณ€์ˆ˜์˜ ์ž์ฒด์ ์ธ ์‹œ๊ณ„์—ด ํŠน์„ฑ์„ ์ž์„ธํžˆ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ์˜ˆ์ธก: ๋‹จ์ผ ๋ณ€์ˆ˜์˜ ๋ฏธ๋ž˜ ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ์ด์ƒ์น˜ ํƒ์ง€: ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ํ†ตํ•ด ์ด์ƒ์น˜๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์‹œ:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt

# ARIMA ๋ชจ๋ธ ์ ์šฉ
model = ARIMA(data['Y'], order=(1,1,1))
results = model.fit()

# ์˜ˆ์ธก
forecast = results.forecast(steps=10)

# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
plt.figure(figsize=(12,6))
plt.plot(data['Y'], label='Observed')
plt.plot(forecast, label='Forecast')
plt.legend()
plt.show()

๋ชจํ˜•์˜ ํ•ด์„

๊ฒ€์ • ๊ฒฐ๊ณผ ํ•ด์„:

  1. X๊ฐ€ Y์— ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ๊ณ , Y๋Š” X์— ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ: X๊ฐ€ Y์˜ ์ธ๊ณผ์š”์ธ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

  2. Y๊ฐ€ X์— ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ๊ณ , X๋Š” Y์— ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ: Y๊ฐ€ X์˜ ์ธ๊ณผ์š”์ธ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

  3. X์™€ Y๊ฐ€ ์„œ๋กœ ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒฝ์šฐ: ์ œ3์˜ ์™ธ๋ถ€๋ณ€์ˆ˜๊ฐ€ ์˜ํ–ฅ์„ ์คฌ์„ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  4. X์™€ Y๊ฐ€ ์„œ๋กœ ์ธ๊ณผ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๊ฒฝ์šฐ: ๋‘ ๋ณ€์ˆ˜ ๊ฐ„ ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์—†๊ฑฐ๋‚˜, ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๊ด€๊ณ„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  5. ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ˆ์ œ

์ด์ œ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„, VAR, ARIMA ๋ชจ๋ธ์„ ์ ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์€ TV, ๋ผ๋””์˜ค, ์‹ ๋ฌธ ๊ด‘๊ณ  ์˜ˆ์‚ฐ๊ณผ ๊ทธ์— ๋”ฐ๋ฅธ ๋งค์ถœ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด 200๊ฐœ์˜ ๊ด€์ธก์น˜๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ ๊ด€์ธก์น˜๋Š” ๋‹ค์Œ ๋ณ€์ˆ˜๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  1. TV Ad Budget ($): TV ๊ด‘๊ณ  ์˜ˆ์‚ฐ
  2. Radio Ad Budget ($): ๋ผ๋””์˜ค ๊ด‘๊ณ  ์˜ˆ์‚ฐ
  3. Newspaper Ad Budget ($): ์‹ ๋ฌธ ๊ด‘๊ณ  ์˜ˆ์‚ฐ
  4. Sales ($): ๋งค์ถœ

๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋ถ„์„

๋จผ์ €, ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ž„ํฌํŠธํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋“œ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋“œ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.api import VAR
import statsmodels.tsa.api as smt
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA

๋ฐ์ดํ„ฐ ๋กœ๋“œ

1
2
3
4
5
6
# ๋ฐ์ดํ„ฐ ๋กœ๋“œ (๋ฐ์ดํ„ฐ๋Š” ์ง์ ‘ ๋‹ค์šด ๋ฐ›์•„์ฃผ์„ธ์š”)
# https://www.kaggle.com/datasets/yasserh/advertising-sales-dataset
df = pd.read_csv("./Data/Advertising Budget and Sales.csv")

print(df.shape)
df.head()

์ •๊ทœํ™” ๋ฐ ์‹œ๊ฐํ™”

1
2
3
4
5
6
7
variables = ['TV Ad Budget ($)', 'Radio Ad Budget ($)', 'Newspaper Ad Budget ($)', 'Sales ($)']

# ๊ฐ ๊ฐ’์˜ ์ฒซ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”
df['Sales ($)'] = df['Sales ($)'] / df['Sales ($)'].iloc[0]
df['TV Ad Budget ($)'] = df['TV Ad Budget ($)'] / df['TV Ad Budget ($)'].iloc[0]
df['Radio Ad Budget ($)'] = df['Radio Ad Budget ($)'] / df['Radio Ad Budget ($)'].iloc[0]
df['Newspaper Ad Budget ($)'] = df['Newspaper Ad Budget ($)'] / df['Newspaper Ad Budget ($)'].iloc[0]

โ“ ์–ด? ์ œ๊ฐ€ ์•„๋Š” ์ •๊ทœํ™”๋ž‘ ์กฐ๊ธˆ ๋‹ค๋ฅธ๋ฐ์š”? => ๋งž์Šต๋‹ˆ๋‹ค

  • ์ด๋Š” ์ƒ๋Œ€์  ๋ณ€ํ™” ์ •๊ทœํ™”(๋˜๋Š” ๊ธฐ์ค€์  ์ •๊ทœํ™”)๋ผ๊ณ  ๋ถˆ๋ฆฌ๋ฉฐ, ์ผ๋ฐ˜์ ์ธ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•๊ณผ๋Š” ๋‹ค๋ฅธ ๋ชฉ์ ๊ณผ ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ๋ฒ•์˜ ์ฃผ์š” ํŠน์ง•๊ณผ ์ผ๋ฐ˜ ์ •๊ทœํ™”์™€์˜ ์ฐจ์ด์ ์„ ์„ค๋ช…๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค:

๐Ÿ”ถ ์ƒ๋Œ€์  ๋ณ€ํ™” ์ •๊ทœํ™”์˜ ๋ชฉ์ 

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„: ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ๋ณ€ํ™”๋ฅผ ๋” ์‰ฝ๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ๊ฐ’ ๋Œ€๋น„ ์ฆ๊ฐ€ ๋˜๋Š” ๊ฐ์†Œ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๋Œ€์  ๋ณ€ํ™” ๋น„๊ต: ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์œ„๋‚˜ ์Šค์ผ€์ผ์„ ๊ฐ€์ง„ ๋ณ€์ˆ˜๋“ค์˜ ๋ณ€ํ™”๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ๊ทธ๋žœ์ € ์ธ๊ณผ์„ฑ ๋ถ„์„์— ์œ ์šฉ: ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ๋ณ€ํ™”๋ฅผ ๋น„๊ตํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด ์ธ๊ณผ๊ด€๊ณ„ ๋ถ„์„์— ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

๐Ÿ”ท ์ผ๋ฐ˜ ์ •๊ทœํ™”์™€์˜ ์ฐจ์ด์ 

  • ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๊ด€์ :

    • ์ƒ๋Œ€์  ๋ณ€ํ™” ์ •๊ทœํ™”: ์ฒซ ๋ฒˆ์งธ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๋ชจ๋“  ๊ฐ’์„ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

    • ์ผ๋ฐ˜ ์ •๊ทœํ™”(์˜ˆ: Min-Max): ์ „์ฒด ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ํŠน์ • ๋ฒ”์œ„(์˜ˆ: 0-1)๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฒฐ๊ณผ ํ•ด์„ ๊ด€์ :

    • ์ƒ๋Œ€์  ๋ณ€ํ™” ์ •๊ทœํ™”: ์ดˆ๊ธฐ ๊ฐ’ ๋Œ€๋น„ ๋ณ€ํ™”์œจ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

    • ์ผ๋ฐ˜ ์ •๊ทœํ™”: ์ „์ฒด ๋ฐ์ดํ„ฐ ๋ฒ”์œ„ ๋‚ด์—์„œ์˜ ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๊ด€์ :

    • ์ƒ๋Œ€์  ๋ณ€ํ™” ์ •๊ทœํ™”: ์›๋ž˜ ๋ฐ์ดํ„ฐ์˜ ์ƒ๋Œ€์  ๋ณ€ํ™” ํŒจํ„ด์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

    • ์ผ๋ฐ˜ ์ •๊ทœํ™”: ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด์ ์ธ ๋ถ„ํฌ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
df.plot(figsize=(15,3))
plt.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# matplotlib์˜ ๊ธฐ๋ณธ ์ƒ‰์ƒ ์ˆœ์„œ ์‚ฌ์šฉ
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# ๊ฐ ๋ณ€์ˆ˜๋ฅผ ๋ณ„๋„์˜ ์„œ๋ธŒํ”Œ๋กฏ์œผ๋กœ ๊ทธ๋ฆฌ๊ธฐ
fig, axes = plt.subplots(len(variables), 1, figsize=(15, 3*len(variables)))
fig.suptitle('Normalized Values Over Time (Subplots)', fontsize=16)

for i, var in enumerate(variables):
    axes[i].plot(df.index, df[var], label=var, color=colors[i])
    axes[i].set_title(var)
    axes[i].set_xlabel('Index')
    axes[i].set_ylabel('Normalized Value')
    axes[i].legend()
    axes[i].grid(True)

    # y์ถ• ๋ฒ”์œ„ ์„ค์ • (0๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ์˜ ์ตœ๋Œ€๊ฐ’๊นŒ์ง€)
    axes[i].set_ylim(0, df[var].max() * 1.1)  # ์ตœ๋Œ€๊ฐ’์— 10% ์—ฌ์œ  ์ถ”๊ฐ€

plt.tight_layout()
plt.show()

์ •์ƒ์„ฑ ํ™•์ธ

๋ณธ๊ฒฉ์ ์œผ๋กœ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์•ž์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •์ƒ์„ฑ(stationary)์„ ๊ฐ€์ง€๋Š”์ง€ ํ™•์ธ์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ๋ณ„๋„๋กœ ์ฐจ๋ถ„์„ ์ˆ˜ํ–‰ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

(์•ž์— ๊ฐœ๋… ๋‚ด์šฉ ์ฐธ๊ณ )

  1. ์ •์ƒ์„ฑ: ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์„ฑ(stationary)์„ ๊ฐ€์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ผ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from statsmodels.tsa.stattools import adfuller, grangercausalitytests

# ADF ํ…Œ์ŠคํŠธ ํ•จ์ˆ˜ ์ •์˜
def adf_test(timeseries):
    result = adfuller(timeseries, autolag='AIC')
    return pd.Series({'Test Statistic': result[0], 'p-value': result[1], 'Critical Values': result[4]})

adf_results = {var: adf_test(df[var]) for var in variables}

# ADF ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ์ถœ๋ ฅ ๋ฐ ํ•ด์„
for var, result in adf_results.items():
    print(f"\nADF Test Results for {var}:")
    print(result)
    
    if result['p-value'] <= 0.05:
        print(f"ํ•ด์„: {var}๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.")
    else:
        print(f"ํ•ด์„: {var}๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ๋น„์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.")

์•„๋ž˜ ADF ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ๋ชจ๋“  ๋ณ€์ˆ˜๋“ค์ด 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๋ถ„์„์„ ์œ„ํ•ด ๋ณ„๋„์˜ ์ฐจ๋ถ„์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# ๊ฒฐ๊ณผ
ADF Test Results for TV Ad Budget ($):
Test Statistic                                            -14.158141
p-value                                                          0.0
Critical Values    {'1%': -3.4636447617687436, '5%': -2.876176117...
dtype: object
ํ•ด์„: TV Ad Budget ($)๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.

ADF Test Results for Radio Ad Budget ($):
Test Statistic                                            -14.129897
p-value                                                          0.0
Critical Values    {'1%': -3.4636447617687436, '5%': -2.876176117...
dtype: object
ํ•ด์„: Radio Ad Budget ($)๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.

ADF Test Results for Newspaper Ad Budget ($):
Test Statistic                                            -13.344195
p-value                                                          0.0
Critical Values    {'1%': -3.4636447617687436, '5%': -2.876176117...
dtype: object
ํ•ด์„: Newspaper Ad Budget ($)๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.

ADF Test Results for Sales ($):
Test Statistic                                            -13.990162
p-value                                                          0.0
Critical Values    {'1%': -3.4636447617687436, '5%': -2.876176117...
dtype: object
ํ•ด์„: Sales ($)๋Š” 5% ์œ ์˜์ˆ˜์ค€์—์„œ ์ •์ƒ ์‹œ๊ณ„์—ด์ž…๋‹ˆ๋‹ค.

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •

๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๊ฒ€์ •์„ ํ†ตํ•ด ๊ฐ ๊ด‘๊ณ  ์ฑ„๋„์˜ ์˜ˆ์‚ฐ์ด ๋งค์ถœ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tsa.api import VAR
from statsmodels.tools.eval_measures import aic, bic

def optimal_lag(data, max_lag):
    model = VAR(data)
    results = {}
    for lag in range(1, max_lag + 1):
        result = model.fit(lag)
        results[lag] = result.aic
    return min(results, key=results.get)

def granger_causality(data, variables, max_lag=12, alpha=0.05):
    results = {}
    for i in range(len(variables)):
        for j in range(len(variables)):
            if i != j:
                pair = (variables[i], variables[j])
                opt_lag = optimal_lag(data[[variables[i], variables[j]]], max_lag)
                granger_result = grangercausalitytests(data[[variables[i], variables[j]]], maxlag=opt_lag, verbose=False)
                p_values = [granger_result[lag][0]['ssr_ftest'][1] for lag in range(1, opt_lag+1)]
                
                results[pair] = {
                    'optimal_lag': opt_lag,
                    'p_values': p_values,
                    'significant': any(p < alpha for p in p_values)
                }
                
                plt.figure(figsize=(10, 6))
                plt.plot(range(1, opt_lag+1), p_values, marker='o')
                plt.axhline(y=alpha, color='r', linestyle='--')
                plt.title(f'Granger Causality: {variables[i]} -> {variables[j]}')
                plt.xlabel('Lag')
                plt.ylabel('p-value')
                plt.show()
    
    return results

# ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„ ๋ถ„์„ ์ˆ˜ํ–‰
results = granger_causality(df, variables)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
for pair, result in results.items():
    print(f"\nGranger Causality: {pair[0]} -> {pair[1]}")
    print(f"Optimal lag: {result['optimal_lag']}")
    print(f"Significant: {result['significant']}")
    print(f"p-values: {result['p_values']}")

๊ฒฐ๊ณผํ•ด์„

  1. ์œ ์˜๋ฏธํ•œ ๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„:
  • TV Ad Budget(X) -> Radio Ad Budget (Y): 2 lag์—์„œ ์œ ์˜๋ฏธํ•จ.
  • TV Ad Budget(X) -> Sales (Y): 2 lag์—์„œ ์œ ์˜๋ฏธํ•จ.
  1. ๋‹ค๋ฅธ ๊ด€๊ณ„๋“ค:
  • ๋Œ€๋ถ€๋ถ„์˜ ๊ด€๊ณ„๊ฐ€ ํ†ต๊ณ„์ ์œผ๋กœ ๋ฌด์˜๋ฏธํ•จ.
  1. ์ฃผ์š” ์ธ์‚ฌ์ดํŠธ:
  • TV ๊ด‘๊ณ  ์˜ˆ์‚ฐ์ด ๋ผ๋””์˜ค ๊ด‘๊ณ  ์˜ˆ์‚ฐ๊ณผ ๋งค์ถœ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„.

์ด์ œ ์ด๋ฅผ ํ™œ์šฉํ•ด์„œ Sales Forecasting์„ ๊ตฌ์ถ•ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  1. VAR ๋ชจ๋ธ(๋˜๋Š” ARIMA ๋ชจ๋ธ) ์‚ฌ์šฉ:
  • TV Ad Budget๊ณผ Sales๋ฅผ ํฌํ•จํ•œ VAR ๋ชจ๋ธ(๋˜๋Š” ARIMA ๋ชจ๋ธ) ๊ตฌ์ถ•
  • ์ตœ์  lag๋Š” 2๋กœ ์„ค์ •
  1. ํŠน์„ฑ ์„ ํƒ:
  • TV Ad Budget(X)์„ ๋…๋ฆฝ๋ณ€์ˆ˜, Sales(Y)๋ฅผ ์ข…์†๋ณ€์ˆ˜๋กœ ๋ชจ๋ธ๋งํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Train/Test Split

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.api import VAR
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
df = pd.read_csv("./Data/Advertising Budget and Sales.csv")
df = df[['TV Ad Budget ($)', 'Sales ($)']]

df['Sales ($)'] = df['Sales ($)'] / df['Sales ($)'].iloc[0]
df['TV Ad Budget ($)'] = df['TV Ad Budget ($)'] / df['TV Ad Budget ($)'].iloc[0]

df.columns = ['tv', 'sales'] # ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด๋ฆ„ ๋ณ€๊ฒฝ

# ๋ฐ์ดํ„ฐ ๋ถ„ํ•  (80% ํ›ˆ๋ จ, 20% ํ…Œ์ŠคํŠธ)
train_size = int(len(df) * 0.8)
train, test = df[:train_size], df[train_size:]

# ๊ทธ๋ž˜ํ”„ ์„ค์ •
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
fig.suptitle('Train and Test Data Visualization')

# TV Ad Budget ๊ทธ๋ž˜ํ”„
ax1.plot(df.index[:train_size], train['tv'], label='Train')
ax1.plot(df.index[train_size:], test['tv'], label='Test')
ax1.axvline(x=train_size, color='r', linestyle='--', label='Train/Test Split')
ax1.set_ylabel('TV Ad Budget ($)')
ax1.legend()

# Sales ๊ทธ๋ž˜ํ”„
ax2.plot(df.index[:train_size], train['sales'], label='Train')
ax2.plot(df.index[train_size:], test['sales'], label='Test')
ax2.axvline(x=train_size, color='r', linestyle='--', label='Train/Test Split')
ax2.set_ylabel('Sales ($)')
ax2.legend()

# x์ถ• ๋ ˆ์ด๋ธ” ์„ค์ •
ax2.set_xlabel('Time')

plt.tight_layout()
plt.show()

VAR (Vector Autoregression) ๋ชจ๋ธ

์ด์ œ VAR ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ๋™์  ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# USING VAR
lag = 2

# VAR ๋ชจ๋ธ
model_var = VAR(train)
results_var = model_var.fit(lag)

# VAR ๋™์  ์˜ˆ์ธก
forecast_var = []
forecast_input_var = train.values[-lag:]

for i in range(len(test)):
    fc = results_var.forecast(forecast_input_var, steps=1)
    forecast_var.append(fc[0])
    
    # ์‹ค์ œ test ๋ฐ์ดํ„ฐ์˜ ์ด์ „ lag ๊ฐ’์œผ๋กœ forecast_input_var ์—…๋ฐ์ดํŠธ
    if i + 1 < len(test):
        forecast_input_var = np.vstack([forecast_input_var[1:], test.iloc[i].values])
    
forecast_var = np.array(forecast_var)
rmse_var_sales = sqrt(mean_squared_error(test['sales'], forecast_var[:, 1]))

๋น„๊ต๋ฅผ ์œ„ํ•ด SALES๋งŒ ๊ฐ€์ง€๊ณ  ํ•™์Šต์„ ํ•œ ARIMA ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

ARIMA (AutoRegressive Integrated Moving Average) ๋ชจ๋ธ

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋งค์ถœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ARIMA ๋ชจ๋ธ์„ ์ ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
train_sales = train['sales']
test_sales = test['sales']

# ARIMA ๋ชจ๋ธ
model_arima = ARIMA(train_sales, order=(lag, 0, 0))
results_arima = model_arima.fit()

# ๋™์  ์˜ˆ์ธก์„ ์œ„ํ•œ ARIMA
forecast_arima_dynamic = []
history = train_sales.tolist()

for t in range(len(test_sales)):
    model_arima_dynamic = ARIMA(history, order=(lag, 0, 0))
    model_fit_dynamic = model_arima_dynamic.fit()
    yhat = model_fit_dynamic.forecast(steps=1)[0]
    forecast_arima_dynamic.append(yhat)
    history.append(test_sales.iloc[t])

rmse_arima_sales_dynamic = sqrt(mean_squared_error(test_sales, forecast_arima_dynamic))

๋ชจ๋ธ ๋น„๊ต

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
print("\nModel Comparison:")
print(f"VAR Model RMSE: {rmse_var_sales:.5f}")
print(f"VAR Model MSE: {mse_var_sales:.5f}")
print(f"VAR Model MAE: {mae_var_sales:.5f}")
print("-"*50)
print(f"ARIMA Model RMSE: {rmse_arima_sales_dynamic:.5f}")
print(f"ARIMA Model MSE: {mse_arima_sales_dynamic:.5f}")
print(f"ARIMA Model MAE: {mae_arima_sales_dynamic:.5f}")

# Model Comparison:
# VAR Model RMSE: 0.23431
# VAR Model MSE: 0.05490
# VAR Model MAE: 0.19481
# --------------------------------------------------
# ARIMA Model RMSE: 0.23493
# ARIMA Model MSE: 0.05519
# ARIMA Model MAE: 0.18524
1
2
3
4
5
6
7
8
9
10
11
12
# ์˜ˆ์ธก ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
plt.figure(figsize=(12, 6))
plt.plot(test.index, test.sales, label='Actual Sales', color='black')
plt.plot(test.index, test.tv, label='Actual TV', color='orange', linestyle = '--')

plt.plot(test.index, forecast_var[:, 1], label='VAR Sales Forecast', color='blue')
plt.plot(test.index, forecast_arima_dynamic, label='Dynamic ARIMA Sales Forecast', color='red')
plt.legend()
plt.xlabel('Time')
plt.ylabel('Sales ($)')
plt.title('Sales Forecast Comparison: VAR vs Dynamic ARIMA')
plt.show()

๋ถ„์„ ํ•ด์„

์ œ๊ณต๋œ ์„ฑ๋Šฅ ์ง€ํ‘œ์™€ ์‹œ๊ฐ์  ๊ฒ€ํ† ๋ฅผ ์ข…ํ•ฉํ•ด ๋ณผ ๋•Œ, ํŠน๋ณ„ํžˆ ์–ด๋А ํ•œ ๋ชจ๋ธ์ด ๋” ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ๊ฒฐ๋ก  ๋‚ด๋ฆฌ๊ธฐ๋Š” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ์ข… ๋ชจ๋ธ ์„ ํƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถ”๊ฐ€์ ์ธ ์š”์†Œ๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ชจ๋ธ ํ•ด์„์„ฑ: TV Ad Budget ๋ณ€์ˆ˜๊ฐ€ Sales์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค๋ฉด VAR ๋ชจ๋ธ์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ์˜ ๋ชฉ์ : ์˜ˆ์ธก์˜ ๋ชฉ์ ์ด ๋ฌด์—‡์ธ์ง€์— ๋”ฐ๋ผ ๋ชจ๋ธ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹จ์ˆœํžˆ ๊ณผ๊ฑฐ Sales ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด ๋™์  ARIMA ๋ชจ๋ธ์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋ณต์žก๋„: VAR ๋ชจ๋ธ์€ ๋‹ค๋ณ€๋Ÿ‰ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ๋” ๋ณต์žกํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋™์  ARIMA ๋ชจ๋ธ์€ ๋‹จ๋ณ€๋Ÿ‰ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€์ ์œผ๋กœ ๋‹จ์ˆœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  4. ๋ฐ์ดํ„ฐ ๊ฐ€์šฉ์„ฑ: ์ถ”๊ฐ€์ ์ธ ์™ธ์ƒ ๋ณ€์ˆ˜๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ VAR ๋ชจ๋ธ์ด ์œ ๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, ์‹ค์ œ ์‚ฌ์šฉ ๋ชฉ์ ๊ณผ ์ƒํ™ฉ์— ๋งž์ถ”์–ด ๋‘ ๋ชจ๋ธ ์ค‘ ์ ํ•ฉํ•œ ๊ฒƒ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋žœ์ € ์ธ๊ณผ๊ด€๊ณ„, VAR, ARIMA ๋ชจ๋ธ์€ ๊ฐ๊ฐ ์žฅ๋‹จ์ ์ด ์žˆ์œผ๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ๊ณผ ๋ถ„์„ ๋ชฉ์ ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์ž˜ ํ™œ์šฉํ•œ๋‹ค๋ฉด, ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ณ  ๋” ๋‚˜์€ ๋น„์ฆˆ๋‹ˆ์Šค ์˜์‚ฌ๊ฒฐ์ •์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์˜ค๋Š˜๋„ ๋„์›€์ด ๋˜์…จ๊ธธ ๋ฐ”๋ผ๋ฉฐ ์ €๋„ ์ด๋งŒ ํ”„๋กœ์ ํŠธ์— ์ ์šฉํ•˜๋Ÿฌ ๊ฐ€๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค! โญ



-->