[๋จธ์‹ ๋Ÿฌ๋‹][์‹œ๊ณ„์—ด] AR, MA, ARMA, ARIMA์˜ ๋ชจ๋“  ๊ฒƒ - ์‹ค์ŠตํŽธ

Posted by Euisuk's Dev Log on October 9, 2021

[๋จธ์‹ ๋Ÿฌ๋‹][์‹œ๊ณ„์—ด] AR, MA, ARMA, ARIMA์˜ ๋ชจ๋“  ๊ฒƒ - ์‹ค์ŠตํŽธ

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/๋จธ์‹ ๋Ÿฌ๋‹์‹œ๊ณ„์—ด-AR-MA-ARMA-ARIMA์˜-๋ชจ๋“ -๊ฒƒ-์‹ค์ŠตํŽธ

๋ณธ ํฌ์ŠคํŒ…์€ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์‹œ๊ณ„์—ด ๋ถ„์„์˜ ์ „๋ฐ˜์ ์ธ ํ”„๋กœ์„ธ์Šค ๊ณผ์ •์„ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ„์„ ํŒ์ด๋‚˜ ํ”ผ๋“œ๋ฐฑ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค!! ๐Ÿ™‡โ€โ™‚๏ธ

๋ถ„์„์˜ ์ˆœ์„œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ถ„์„์€ ์•ž์—์„œ ๋‹ค๋ฃฌ ์‹œ๊ณ„์—ด ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ ๊ฒƒ์ด๋ฏ€๋กœ ๊ฐœ๋…์„ ์ฐพ์•„๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ์ œ ์ด์ „ ํฌ์ŠคํŠธ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ์ฐธ๊ณ 
  2. ๋ฐ์ดํ„ฐ ์„ ์ • ๋ฐ ์ „์ฒ˜๋ฆฌ
  3. ๋ฐ์ดํ„ฐ์˜ ์ •์ƒ์„ฑ(๋น„์ •์ƒ) ํ”„๋กœ์„ธ์Šค ํŒ๋ณ„
  4. ์ฐจ๋ถ„ ์ˆ˜ํ–‰ ๋ฐ ์ •์ƒ์„ฑ(๋น„์ •์ƒ) ํ”„๋กœ์„ธ์Šค ํŒ๋ณ„
  5. ARIMA ๋ชจ๋ธ ์‹๋ณ„ ๋ฐ ์ถ”์ • ์ˆ˜ํ–‰
  6. ๋ชจ๋ธ ํ‰๊ฐ€

์ฐธ๊ณ (๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ํ‰๊ฐ€์ง€ํ‘œ)

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX 
from pmdarima.arima import auto_arima
import math

import itertools

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-whitegrid')

ํ‰๊ฐ€์ง€ํ‘œ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn import metrics

def mae(y_true, y_pred):
    return metrics.mean_absolute_error(y_true,y_pred) #MAE
def mse(y_true, y_pred):
    return metrics.mean_squared_error(y_true,y_pred) # MSE
def rmse(y_true, y_pred):    
    return np.sqrt(metrics.mean_squared_error(y_true,y_pred))  # RMSE
def r2(y_true, y_pred):    
    return metrics.r2_score(y_true,y_pred) # R2
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true)) * 100 # MAPE

def get_score(model, y_true, y_pred):
    model = model
    mae_val = mae(y_true, y_pred)
    mse_val = mse(y_true, y_pred)
    rmse_val = rmse(y_true, y_pred)
    r2_val = r2(y_true, y_pred)
    mape_val = mape(y_true, y_pred)
    
    score_dict = {"model": model,
                  "mae" :  mae_val,
                  "mse" :  mse_val,
                  "rmse" : rmse_val,
                  "r2":    r2_val, 
                  "mape" : mape_val
                 }
    return score_dict
  1. ๋ฐ์ดํ„ฐ ์„ ์ • ๋ฐ ์ „์ฒ˜๋ฆฌ

ํ•œ๊ตญ๊ณตํ•ญ๊ณต์‚ฌ

ํ•œ๊ตญ๊ณตํ•ญ๊ณต์‚ฌ, https://www.airport.co.kr/www/extra/stats/timeSeriesStats/layOut.do?menuId=399

์ด๋ฒˆ ๋ถ„์„์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” ํ•œ๊ตญ๊ณตํ•ญ๊ณต์‚ฌ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ณตํ•ญ๋ณ„ ์ด์šฉ๊ฐ ์ˆ˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋กœ, 2002๋…„ 1์›”1์ผ๋ถ€ํ„ฐ 2020๋…„ 9์›”1์ผ๊นŒ์ง€ ๋งค์›” ์ˆ˜์ง‘๋œ ์ด ์—ฌ๊ฐ๊ธฐ ์‚ฌ์šฉ ๊ณ ๊ฐ์˜ ์ˆ˜์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ ์—ฌ๊ฐ๊ธฐ๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ์ด๋‹ค ๋ณด๋‹ˆ ์ตœ๊ทผ 2020๋…„ 2์›”๋ถ€ํ„ฐ ์‹ฌ๊ฐํ•ด์ง„ ์ฝ”๋กœ๋‚˜ ์‚ฌํƒœ๋กœ ์ธํ•˜์—ฌ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ฌ๊ฐํ•˜๊ฒŒ ์š”๋™์น˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๊ณ , ์™ธ๋ถ€ ์š”์ธ ์—†์ด ๋‹จ์ˆœ ARIMA ๋ชจ๋ธ๋กœ๋Š” ์ฝ”๋กœ๋‚˜๋ผ๋Š” ์š”์ธ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™์•„ ๋ถ„์„ ๋ฐ์ดํ„ฐ์˜ ๊ธฐ๊ฐ„์„ ์กฐ๊ธˆ ์กฐ์ •ํ•˜์—ฌ ์ฝ”๋กœ๋‚˜ ์‚ฌํƒœ๊ฐ€ ํ„ฐ์ง€๊ธฐ ์ด์ „์ธ 2020๋…„ 1์›”๊นŒ์ง€์˜ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ฝ”๋กœ๋‚˜ ์—ฌํŒŒ

๋˜ํ•œ ๋ชจ๋ธ ๊ตฌ์ถ• ํ›„ ์˜ˆ์ธก์— ๋Œ€ํ•œ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ Train, Test Split์˜ ๋น„์œจ์„ 8:2๋กœ ๋‘์–ด Train Dataset๊ณผ Test Dataset์„ ๊ตฌ์ถ•ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  1. ๋ฐ์ดํ„ฐ์˜ ์ •์ƒ์„ฑ(๋น„์ •์ƒ) ํ”„๋กœ์„ธ์Šค ํŒ๋ณ„

์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•œ ํŒ๋‹จ

๋จผ์ €, ๋ฐ์ดํ„ฐ์— ์•„๋ฌด๋Ÿฐ ๋ณ€ํ™”๋ฅผ ์ฃผ์ง€ ์•Š๊ณ , ์ด๋ฅผ ํŒŒ์ด์ฌ stats-models ๋ชจ๋“ˆ์˜ time series analysis์— ๋‚ด์žฅ๋˜์–ด ์žˆ๋Š” seasonal decomposition ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์—์„œ ์ถ”์„ธ(Trend), ๊ณ„์ ˆ(Seasonal), ์˜ˆ์ธก์˜ค์ฐจ(Residual)๋ฅผ ๋ถ„๋ฆฌํ•ด ๋‚ด์–ด ์‹œ๊ฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์˜ ์ขŒ์ธก์„ ๋ณด๋ฉด, ์ˆœ์„œ๋Œ€๋กœ ๋ณธ๋ž˜ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ทธ๋ž˜ํ”„, ์ถ”์„ธ ๊ทธ๋ž˜ํ”„, ๊ณ„์ ˆ ๊ทธ๋ž˜ํ”„, ์˜ˆ์ธก์˜ค์ฐจ ๊ทธ๋ž˜ํ”„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ถ”์„ธ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๊ณ , ๊ณ„์ ˆ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ๊ธฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , ์˜ˆ์ธก์˜ค์ฐจ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํ‰๊ท ์ด 0์ด๊ณ  ๋ถ„์‚ฐ์ด ์ผ์ •ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ทธ๋ž˜ํ”„๋“ค์„ ํ†ตํ•ด ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ stationaryํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
3
4
decompostion = sm.tsa.seasonal_decompose(data['customer'],  model='additive')
fig = decompostion.plot()
fig.set_size_inches(10,10)
plt.show()

Decomposition

์ข€ ๋” ์ž์„ธํ•œ ๋ถ„์„์„ ์œ„ํ•˜์—ฌ ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ , ACF, PACF๋ฅผ ๊ทธ๋ ค๋ณด์•˜์Šต๋‹ˆ๋‹ค. ACF์™€ PACF์˜ ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด stationary๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ACF positive autocorrelation์„ ๊ฐ€์ง€๋ฉฐ ์„œ์„œํžˆ ๊ฐ์†Œํ•˜๋Š” ์–‘์ƒ์„ ๋„๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ non-stationaryํ•จ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
3
4
fig, ax = plt.subplots(1,2,figsize=(10,5))
fig.suptitle('Raw Data')
sm.graphics.tsa.plot_acf(train_data.values.squeeze(), lags=30, ax=ax[0])
sm.graphics.tsa.plot_pacf(train_data.values.squeeze(), lags=30, ax=ax[1]);

ACF, PACF

ํ†ต๊ณ„์  ๊ฒ€์ •์„ ํ†ตํ•œ ํŒ๋‹จ

์•ž์žฅ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋น„์ •์ƒ์„ฑ์„ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ํ™•์ธํ•ด๋ณด๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๋‹จ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋น„์ •์ƒ์„ฑ์„ ํ†ต๊ณ„์ ์œผ๋กœ ๊ฒ€์ฆํ•ด๋ณด๊ธฐ ์œ„ํ•ด Durbin Watson Test์™€ Dickey Fuller Test๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Durbin Watson Test

Durbin-Watson Test๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšŒ๊ท€ ๋ชจํ˜•์˜ ์˜ค์ฐจ์— ์ž๊ธฐ ์ƒ๊ด€์ด ์žˆ๋Š”์ง€ ๊ฒ€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž๊ธฐ ์ƒ๊ด€์€ ์ธ์ ‘ ๊ด€์ธก์น˜์˜ ์˜ค์ฐจ๊ฐ€ ์ƒ๊ด€๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. Durbin-Watson Test์˜ ๊ท€๋ฌด๊ฐ€์„ค(H0)์€ โ€œthe error terms are not autocorrelatedโ€์ด๊ณ , ๋Œ€๋ฆฝ๊ฐ€์„ค(H1)์€ โ€œthe error terms are positively autocorrelatedโ€์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  Durbin-Watson Test ๋ฅผ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ ฯ๋Š” 1.10697๋กœ 0๋ณด๋‹ค ํฌ๋ฏ€๋กœ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” autocorrelation์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Durbin Watson Test

Dickey Fuller Test

Dickey Fuller Test๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๊ฐ€ ์ •์ƒ์„ฑ์„ ๊ฐ€์ง€๋Š”์ง€ ๊ฐ€์ง€์ง€ ์•Š๋Š”์ง€๋ฅผ ํ™•์ธํ•  ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Dickey Fuller Test ์˜ ๊ท€๋ฌด๊ฐ€์„ค(H0)์€ โ€œํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์„ฑ์„ ๊ฐ€์ง€์ง€ ์•Š๋Š”๋‹คโ€์ด๊ณ , ๋Œ€๋ฆฝ๊ฐ€์„ค(H1)์€ โ€œํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์„ฑ์„ ๊ฐ–๋Š”๋‹คโ€์ž…๋‹ˆ๋‹ค. Dickey Fuller Test ๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณด๋ฉด p-value๊ฐ€ 0.9884๋กœ ์•„์ฃผ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•˜์ง€ ๋ชปํ–ˆ์œผ๋ฏ€๋กœ ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ์„ฑ์„ ๊ฐ–๋Š”๋‹ค๊ณ  ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

Dickey Fuller Test

  1. ์ฐจ๋ถ„ ์ˆ˜ํ–‰ ๋ฐ ์ •์ƒ/๋น„์ •์ƒ ํŒ๋ณ„

์ฐจ๋ถ„

AR, MA, ARMA ๋ชจ๋ธ์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ stationaryํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ž์—์„œ ๋ดค๋“ฏ์ด ์˜ค๋ฆฌ์ง€๋„ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ๊ฐ€ stationaryํ•˜์ง€ ์•Š๊ธฐ์— ์ฐจ๋ถ„(differencing) ์ž‘์—…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ถ„์„์—์„œ๋Š” 1์ฐจ ์ฐจ๋ถ„(d=1)๊ณผ 2์ฐจ ์ฐจ๋ถ„(d=2)์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์œ„ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ฐจ๋ถ„ ์ž‘์—…์„ ํ†ตํ•ด ๊ธฐ์กด์— non-stationaryํ•˜๋˜ ๋ฐ์ดํ„ฐ๊ฐ€ stationaryํ•˜๊ฒŒ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ 1์ฐจ ์ฐจ๋ถ„, 2์ฐจ ์ฐจ๋ถ„์— ๋Œ€ํ•˜์—ฌ Dickey Fuller Test๋ฅผ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค p-value๊ฐ’์ด 0.05๋ณด๋‹ค ์ž‘์œผ๋ฏ€๋กœ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•˜์—ฌ ์ฐจ๋ถ„๋œ ๋ฐ์ดํ„ฐ๋Š” stationaryํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1์ฐจ ์ฐจ๋ถ„

1
2
3
4
diff_train_data = diff_train_data['customer'].diff()
print(diff_train_data.head())
diff_train_data = diff_train_data.dropna() #๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ ์ฝ”๋“œ(๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„ ์ œ๊ฑฐ)
print(diff_train_data.head())

2์ฐจ ์ฐจ๋ถ„

1
2
3
4
5
diff2_train_data = diff_train_data.copy()
diff2_train_data = diff_train_data.diff()
print(diff2_train_data.head())
diff2_train_data = diff2_train_data.dropna() #๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ ์ฝ”๋“œ(๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„ ์ œ๊ฑฐ)
print(diff2_train_data.head())

์ฐจ๋ถ„ํ†ต๊ณ„

  1. ARIMA ๋ชจ๋ธ ์‹๋ณ„ ๋ฐ ์ถ”์ • ์ˆ˜ํ–‰

ARIMA๋Š” โ‘  ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ โ†’ โ‘ก ์‹œ๋ฒ”์ ์œผ๋กœ ํ™•์ธํ•ด๋ณผ ๋ชจ๋ธ ๊ตฌ์ถ• โ†’ โ‘ข ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ถ”์ •(ํƒ์ƒ‰) โ†’ โ‘ฃ ํƒ€๋‹น์„ฑ ํ™•์ธ โ†’ โ‘ค ์ตœ์ข…๋ชจ๋ธ์„ ์ •์˜ ํ๋ฆ„์œผ๋กœ ๋ถ„์„์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ์•ž์—์„œ 1์ฐจ ์ฐจ๋ถ„, 2์ฐจ ์ฐจ๋ถ„ ๋‘˜ ๋‹ค p-value๊ฐ€ 0.05๋ณด๋‹ค ์ž‘์•„ ๊ท€๋ฌด๊ฐ€์„ค์„ ๊ธฐ๊ฐํ•˜๋ฏ€๋กœ, 1์ฐจ ์ฐจ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ผ๋‹จ ํ˜„์žฌ๊นŒ์ง€ โ‘ ๋‹จ๊ณ„์ธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋Š” ์™„๋ฃŒ๋˜์—ˆ๊ณ , โ‘ก๋‹จ๊ณ„์ธ ์‹œ๋ฒ”์ ์œผ๋กœ ํ™•์ธํ•ด๋ณผ ๋ชจ๋ธ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด 1์ฐจ ์ฐจ๋ถ„๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ACF์™€ PACF๋ฅผ ์‹œ๊ฐํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ACF/PACF

์ด๋ฅผ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ ACF์™€ PACF๊ฐ€ ์ง€์ˆ˜์ ์œผ๋กœ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , cut off์˜ ๊ฒฝ์šฐ๋Š” ๋ช…ํ™•ํ•˜๊ฒŒ ๋‚˜์˜ค์ง€ ์•Š๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์— ์•„๋ž˜ ํ‘œ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ ์šฉํ•˜์—ฌ (p,d,q)๋ฅผ (0,1,1)๋กœ ํ•˜๋Š” ARIMA๋ชจ๋ธ์„ ์‹œ๋ฒ”์ ์œผ๋กœ ๊ตฌ์ถ•ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

Graphical method

(p,d,q)๋ฅผ (0,1,1)๋กœ ํ•˜๋Š” ์‹œ๋ฒ” ARIMA๋ชจ๋ธ์˜ ๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ตฌ์ถ•๋œ ๋ชจ๋ธ์€ AIC๋กœ 4767.825๋ฅผ ๊ฐ–๊ณ , ๊ฐ๊ฐ์˜ constraint์™€ ๋ณ€์ˆ˜์˜ p-value๋Š” 0.05๋ณด๋‹ค ์ž‘์•„ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
# ARIMA model fitting
# p =1 , ์ฐจ๋ถ„ = 1, moving average = 0

model = ARIMA(train_data.values, order=(0,1,1))
model_fit = model.fit()
model_fit.summary()

ํƒ€๋‹น์„ฑ ํ™•์ธ

๋ณธ๊ฒฉ์ ์œผ๋กœ ๊ฐ€์žฅ ์ ํ•ฉํ•œ parameter search๋ฅผ ์œ„ํ•ด ๊ฐ๊ฐ d๋Š” 1๋กœ ๊ณ ์ •(1์ฐจ ์ฐจ๋ถ„)ํ•˜๊ณ , p, q์— ๋Œ€ํ•˜์—ฌ grid search๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (p,q)๋Š” ๊ฐ๊ฐ 0์—์„œ 5๊นŒ์ง€์˜ 6๊ฐœ์˜ ์ˆซ์ž ์กฐํ•ฉ์œผ๋กœ ์ด 36๊ฐ€์ง€ ์กฐํ•ฉ์œผ๋กœ ์ตœ์ ์˜ ๊ฐ’์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํƒ์ƒ‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. (ARIMA ์ถœ๋ ฅ๋ฌผ์„ ์ผ๋ถ€ ์ƒ๋žต)

1
2
3
4
5
6
7
8
9
10
# Parameter search 
auto_arima_model = auto_arima(train_data, 
                              start_p=0, max_p=5, 
                              start_q=0, max_q=5, 
                              seasonal=False,
                              d=1,
                              trace=True,
                              error_action='ignore',  
                              suppress_warnings=True, 
                              stepwise=False)

ARIMA๊ฒฐ๊ณผ

์ตœ์ ์˜ ๊ฐ’์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํƒ์ƒ‰ํ•ด๋ณธ ๊ฒฐ๊ณผ, p = 4, d =1, q = 0 ์ผ ๋•Œ AIC๊ฐ€ ๊ฐ€์žฅ ์ž‘๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (4,1,0)์œผ๋กœ ๊ตฌ์ถ•๋œ ๋ชจ๋ธ์˜ AIC๋Š” 4749.253๋กœ, ์ด์ „ ๋‹จ๊ณ„์—์„œ ์‹œ๋ฒ”์ ์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ๋ณด๋‹ค ์ž‘์€ AIC๊ฐ’์„ ๊ฐ–์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ constraint์™€ ๋ณ€์ˆ˜๋“ค์˜ p-value๋Š” 0.05๋ณด๋‹ค ์ž‘์•„ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ARIMA๊ฒฐ๊ณผ BEST

  1. ๋ชจ๋ธ ํ‰๊ฐ€

์•ž์—์„œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํŒŒํŠธ์—์„œ ์–ธ๊ธ‰ํ–ˆ์ง€๋งŒ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ 8:2 ๋น„์œจ๋กœ train๊ณผ test๋กœ ๋‚˜๋ˆ„์–ด train ๋ฐ์ดํ„ฐ๋Š” ๋ชจ๋ธ ํ•™์Šต ๋ฐ hyper parameter tuning์— ์‚ฌ์šฉ์„ ํ•˜์˜€๊ณ , ๋‚˜๋จธ์ง€ test data๋Š” ์˜ˆ์ธก ํ›„ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‚จ๊ฒจ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์ง€๊ธˆ ๊ตฌ์ถ•ํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•œ test ๊ธฐ๊ฐ„์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ด์— ๋Œ€ํ•œ ํ‰๊ฐ€ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from pandas import Timestamp
%matplotlib inline

# set index
true_index = list(data.index)
predict_index = list(test_data.index)

# make array of true value
true_value = np.array(list(data['customer']))

# plot
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(true_index, true_value, label = 'True')
ax.plot(predict_index, pred, label = 'Prediction')
ax.vlines(Timestamp('2017-01-01 00:00:00'), 0, 100, linestyle='--', color='r', label='Start of Forecast');
ax.fill_between(predict_index, predicted_lb, predicted_ub, color = 'k', alpha = 0.1, label='0.95 Prediction Interval')
ax.legend(loc='upper left')
plt.show()

๊ฒฐ๊ณผ

2002๋…„ 01์›”๋ถ€ํ„ฐ 2016๋…„12์›”๊นŒ์ง€์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ARIMA(4,1,0) ๋ชจ๋ธ์„ ํ†ตํ•ด 2017๋…„ 1์›”๋ถ€ํ„ฐ 2020๋…„ 1์›”๊นŒ์ง€์˜ ๊ธฐ๊ฐ„์˜ ์—ฌ๊ฐ๊ธฐ ์ด์šฉ์ž์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์œ„์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ธ์Šต๋‹ˆ๋‹ค. ARIMA ๋ชจ๋ธ์€ Trend๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ณ , ๊ณ„์ ˆ์„ฑ์ด ๋ฐ˜์˜๋˜์ง€ ์•Š์•˜๊ฒŒ ๋•Œ๋ฌธ์— seasonal cycle์€ ์ž˜ ์˜ˆ์ธกํ•ด๋‚ด์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ 95% ์‹ ๋ขฐ๊ตฌ๊ฐ„์•ˆ์— ๋Œ€๋ถ€๋ถ„์˜ ์˜ˆ์ธก ๊ฐ’๋“ค์ด ์ž˜ ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋‹ค์Œ performance metric๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๋ฅผ ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ residual์ด ์ •๊ทœ์„ฑ์„ ๋„๊ณ , autocorrelation์ด ์—†๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ1

๊ฒฐ๊ณผ2

๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค^~^



-->