Question

I have some time series data which contains some seasonal trends and I want to use an ARIMA model to predict how this series will behave in the future.

In order to predict how my variable of interest (log_var) will behave I have taken a weekly, monthly and annual difference and then used these as the input to an ARIMA model.

Below is an example.

exog = np.column_stack([df_arima['log_var_diff_wk'], 
                        df_arima['log_var_diff_mth'], 
                        df_arima['log_var_diff_yr']]) 

model = ARIMA(df_arima['log_var'], exog = exog, order=(1,0,1)) 
results_ARIMA = model.fit()

I am doing this for several different data sources and in all of them I see great results, in the sense that if I plot log_var against results_ARIMA.fittedvalues for the training data then it matches very well (I tune p and q for each data source separately, but d is always 0 given that I have already taken the difference myself).

However, I then want to check what the predictions look like, and in order to do this I redfine exog to just be the 'test' dataset. For example, if I train the original ARIMA model on 2014-01-01 to 2016-01-01, the 'test' set would just be 2016-01-01 onwards.

My approach has worked well for some data sources (in the sense that I plot the forecast against the known values and the trends look sensible) but badly for others, although they are all the same 'kind' of data and they have just been taken from different geographical locations. In some of the locations it completely fails to catch obvious seasonal trends that occur again and again in the training data on the same dates each year. The ARIMA model always fits the training data well, it just seems that in some cases the predictions are completely useless.

I am now wondering if I am actually following the correct procedure to predict values from the ARIMA model. My approach is basically:

exog = np.column_stack([df_arima_predict['log_val_diff_wk'], 
                        df_arima_predict['log_val_diff_mth'], 
                        df_arima_predict['log_val_diff_yr']])

arima_predict = results_ARIMA.predict(start=training_cut_date, end = '2017-01-01', dynamic = False, exog = exog)

Is this the correct way to go about making predictions with ARIMA?

If so, is there a way I can try to understand why the predictions look very good in some datasets and terrible in others, when the ARIMA model seems to fit the training data just as well in both cases?

Answer 1

我有一个类似的问题atm，我还没有完全弄明白。似乎包括python中的多个季节性术语仍然有点棘手。 R似乎确实有这种能力，see here。因此，我可以给你的一个建议是尝试使用R现在提供的更复杂的功能（尽管如果你还不熟悉R，可能需要大量的时间）。

考虑您对季节性模式建模的方法，采用n阶差分得分并不能给出季节常数，而是表示您指定为季节性相关的时间点之间的差异。如果这些差异很小，那么对它们进行校正可能不会对您的建模结果产生太大影响。在这种情况下，模型预测可能会相当不错。相反，如果差异很大，包括它们可能很容易扭曲预测结果。这可以解释您在建模结果中看到的变化。从概念上讲，你想做的事情就是随着时间的推移而代表常数。

在上面引用的博客文章中，作者提倡使用傅立叶级数来模拟每个时间段内的方差。 NumPy和SciPy软件包都提供了计算快速傅里叶变换的例程。然而，作为一名非数学家，我发现很难确定快速傅立叶变换产生了适当的数字。

最后我选择使用SciPy信号模块的Welch信号分解。这样做可以返回时间序列的谱密度分析，从中可以推断出时间序列中各种频率的信号强度。

如果您确定谱密度分析中的峰值与您在时间序列中考虑的季节性频率相对应，则可以使用它们的频率和幅度来构建表示季节变化的正弦波。然后，您可以将这些作为外生变量包含在ARIMA中，就像博客文章中的傅立叶术语一样。

这就是我现在已经掌握的一点 - 现在我正试图弄清楚我是否可以让statsmodels ARIMA过程使用这些正弦波来指定季节性趋势，作为我的外生变量模型（文档指出它们不应该代表趋势，但嘿，一个人可以做梦，对吗？）编辑：This Rob Hyneman的博客文章也具有高度相关性，并解释了包含傅立叶术语背后的一些理由。 / p>

很抱歉，我无法为您提供经过验证的在Python中有效的解决方案，但我希望这能为您提供一些新的想法来控制这种棘手的季节性差异。

TL; DR：

现在看起来python不太适合处理多个季节性术语，R可能是更好的解决方案（参见参考资料）;
使用差异分数来说明季节性趋势似乎并未捕捉与季节重现相关的恒定差异;
在python中执行此操作的一种方法可以是使用表示季节趋势的傅里叶级数（也参见参考），其可以使用Welch信号分解等方式获得。然而，如何将这些作为ARIMA中的外生变量用于良好效果是一个悬而未决的问题。

祝你好运，

埃弗特

p.s。：如果我找到一种方法让它在Python中运行，我会更新

Predictions with ARIMA (python statsmodels)

1 个答案: