如何预测季节性时间序列数据?

时间:2019-07-17 14:58:14

标签: python time-series statsmodels arima

这是我在这个非常有用的平台上的第一篇文章。我是时间序列建模的初学者。我正在尝试开发用于单变量时间序列预测的 SARIMAX 模型。我有一个设备的两年的每日工作时间数据,我将其重新采样为每周数据。我想预测此设备的未来运行时间(未来16周)。

我尝试了本文所述的网格搜索算法: https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3 识别模型的超级参数。

Dickey-fuller测试表明数据是固定的。以下是打印结果(每周重新采样): 迪基-富勒测试的结果:

Test Statistic                -6.651852e+00
p-value                        5.097401e-09
#Lags Used                     0.000000e+00
Number of Observations Used    7.300000e+01
Critical Value (1%)           -3.523284e+00
Critical Value (5%)           -2.902031e+00
Critical Value (10%)          -2.588371e+00
dtype: float64

我的模型摘要如下所示:

  Statespace Model Results                                 
==========================================================================================
Dep. Variable:                           duration   No. Observations:                   74
Model:             SARIMAX(1, 0, 0)x(1, 1, 0, 26)   Log Likelihood                 -53.441
Date:                            Wed, 17 Jul 2019   AIC                            112.881
Time:                                    16:43:37   BIC                            116.015
Sample:                                         0   HQIC                           113.561
                                             - 74                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2311      0.221      1.044      0.296      -0.203       0.665
ar.S.L26      -0.3097      0.252     -1.228      0.220      -0.804       0.185
sigma2         9.5039      2.397      3.965      0.000       4.806      14.202
===================================================================================
Ljung-Box (Q):                       13.44   Jarque-Bera (JB):                 7.02
Prob(Q):                              0.86   Prob(JB):                         0.03
Heteroskedasticity (H):               3.84   Skew:                            -0.60
Prob(H) (two-sided):                  0.10   Kurtosis:                         5.56
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

以下是建模代码:

mod = sm.tsa.statespace.SARIMAX(df_train,
                                order=(1, 0, 0),
                                seasonal_order=(1,1,0,26),
                                enforce_stationarity=False,
                                enforce_invertibility=False)
results = mod.fit(disp=False)

pred = results.get_forecast(steps= len(df_test))

该预测似乎已推迟2周。我已经把结果附在这篇文章上了。 这是预测值偏移的结果:

dh = dh.shift(-2).dropna()

Shifted forecast values in red

Image showing forecast and actual values of operating hours. Red is forecasted, blue is actual data 有人可以弄清楚我的方法是否正确,并解释为什么预测结果会偏离两周(不过季节性因素相差一星期)?

ps:在研究了季节性分解图后,我选择了26作为季节性分量。 这是用于澄清的测试数据:

date    duration
8/7/2016    14.75865079
8/14/2016   15.72940476
8/21/2016   16.12214286
8/28/2016   14.3756746
9/4/2016    14.90861111
9/11/2016   15.34690476
9/18/2016   16.15107143
9/25/2016   15.98257937
10/2/2016   8.374642857
10/9/2016   15.12717593
10/16/2016  15.91464286
10/23/2016  15.8356746
10/30/2016  16.75575397
11/6/2016   14.32138889
11/13/2016  15.60551587
11/20/2016  16.24988095
11/27/2016  15.95936508
12/4/2016   14.61742063
12/11/2016  13.545
12/18/2016  17.02488095
12/25/2016  9.159555556
1/8/2017    12.81242063
1/15/2017   16.20285714
1/22/2017   17.0834127
1/29/2017   18.40464286
2/5/2017    13.39559524
2/12/2017   16.36452381
2/19/2017   16.67698413
2/26/2017   15.62789683
3/5/2017    17.31428571
3/12/2017   17.40829365
3/19/2017   15.82539683
3/26/2017   15.21595238
4/2/2017    16.4109127
4/9/2017    11.38543651
4/16/2017   11.46966667
4/23/2017   13.79509259
4/30/2017   16.13079365
5/7/2017    14.43949074
5/14/2017   14.25813492
5/21/2017   15.21011905
5/28/2017   15.13231481
6/4/2017    13.35690476
6/11/2017   11.24513889
6/18/2017   16.33047619
6/25/2017   15.20654762
7/2/2017    13.08047619
7/9/2017    15.07047619
7/16/2017   16.03702381
7/23/2017   14.91428571
7/30/2017   13.3331746
8/6/2017    13.09619048
8/13/2017   14.51670635
8/20/2017   15.48579365
8/27/2017   10.42162698
9/3/2017    14.43809524
9/10/2017   15.2334127
9/17/2017   14.91301587
9/24/2017   14.6190873
10/1/2017   15.05559524
10/8/2017   16.16888889
10/15/2017  10.23011905
10/22/2017  14.50650794
10/29/2017  16.0815873
11/5/2017   13.52162037
11/12/2017  13.93670635
11/19/2017  14.02361111
11/26/2017  14.46198413
12/3/2017   14.57138889
12/10/2017  15.00194444
12/17/2017  6.562777778
12/24/2017  9.812314815
12/31/2017  9.812314815
1/7/2018    12.87944444
1/14/2018   15.5634127
1/21/2018   16.02464286
1/28/2018   14.96492063
2/4/2018    16.66015873
2/11/2018   11.89059524
2/18/2018   14.45646825
2/25/2018   14.84785714
3/4/2018    15.39595238
3/11/2018   14.02646825
3/18/2018   16.09496032
3/25/2018   14.69738095
4/1/2018    9.777777778
4/8/2018    13.21705556
4/15/2018   15.90865079
4/22/2018   16.01595238
4/29/2018   16.88354167

谢谢!

1 个答案:

答案 0 :(得分:0)

我认为您的代码很好,并且您的模型预测是正确的,而不是2周之内,就像那样,因为在黑客入侵参数后,这是“随机”结果...;-)

但是我认为您的模型本身就是问题所在。您如何精确选择参数(p,d,q)(P,D,Q)s?您的数据似乎是季节性的/有月度周期,因此您可能应该将s参数保留在12(就像在docs中建议的那样)。

我对其他参数进行了网格搜索,并通过均方根误差(from statsmodels.tools.eval_measures import rmse)对其进行了评估。

最好的结果是:

mod = sm.tsa.statespace.SARIMAX(df_train,
                                order=(2, 1, 0),
                                seasonal_order=(1,1,1,12),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

enter image description here

但是该模型不是最佳模型,您可能需要更多数据才能获得更好的模型(或尝试其他算法)。