在Python statsmodels.tsa ARIMA中包含多个季节性术语

时间:2016-09-19 17:25:52

标签: python time-series statsmodels

我正在尝试使用python 2.7.11和优秀的statsmodels.tsa包在python中建模时间序列。我的数据包括几周内每小时交通强度的测量结果。因此,数据具有多个季节性成分,天数为24小时;周形成168小时。

此时,statsmodels.tsa中的建模选项未设置为处理多个季节性,因为它们仅允许指定一个季节性因子。然而,我遇到了Rob Hyneman关于R的多个季节性的工作。他advocates使用傅里叶级数对时间序列的季节性成分进行建模,包括模型中的傅立叶级数,用于对应于每个季节周期的频率。

我使用Welch的方法在我观察到的时间序列中获得信号的功率谱密度,提取信号中的峰值,这些峰值对应于我期望我的季节性效应的频率,并使用频率和幅度来产生一个正弦波模式,对应于我在数据中预期的季节性趋势。顺便说一句,我认为这允许我绕过Hyneman基于AIC选择k值的步骤,因为我使用的是观察数据中固有的信号。

为确保正弦波与数据中季节性模式的出现相匹配,我通过直观地选择24小时之一的峰值,将两个正弦波模式的峰值与观测数据中的峰值相匹配,并将其出现的小时与表示正弦波的变量的最高值相匹配。在此之前,我已经检查过每日峰值始终在同一时间发生。

到目前为止,似乎很好看 - 用获得的频率和幅度构建的正弦波图大致对应于观测数据。然后我拟合ARIMA(2,0,0)模型,包括两个基于分解的变量作为外生变量。此时,我想测试模型的预测效用。然而,这是事情变得复杂的地方。

当我从statsmodels包中使用ARIMA时,我从拟合模型得到的估计形成了一个复制正弦波的模式,但是具有与我的观察匹配的一系列值。观察结果仍存在很多差异,而季节性趋势并未对此进行解释,这让我相信模型拟合过程中的某些地方并不像预期的那样。

不幸的是,我不熟悉时间序列建模的艺术,知道我的意外结果是否是由于我所包含的外生变量的性质,我应该使用的statsmodels功能,但是我要省略,或者关于季节性趋势概念的错误假设。

我遇到的一些具体问题是:

  • 是否可以在使用python中的statsmodel的ARIMA模型中包含多个季节性趋势(即基于傅立叶或基于分解)?

  • 如果将正弦波作为外生变量包含在上面和下面的代码中的模型中,使用正弦波重建季节性趋势会导致困难吗?

  • 为什么在下面的代码中指定的模型不会产生与观察数据更紧密匹配的预测?

非常感谢任何帮助!

祝福,并提前致谢,

埃弗特

p.s:抱歉,如果我的代码示例和数据文件过长 - 因为我不确定导致意外结果的原因我认为我会发布整个事情。另外,对于有时不遵循PEP8的道歉 - 我还在学习:)

代码示例:

import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator


# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#


def plotmean(timeseries, show=0, path=''):
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue', label='Observed scores')
    mean = plt.plot(rolmean, color='red', label='Rolling mean')
    std = plt.plot(rolstd, color='black', label='Rolling SD')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#


def runwelch(dta, show, path):
    nps = (len(dta) / 2) + 8
    nov = nps / 2
    fft = nps
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
    plt.plot(f, Pxx_den)
    plt.ylim([0.5e-7, 10])
    plt.xlabel('frequency [Hz]')
    plt.ylabel('PSD [V**2/Hz]')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return f, Pxx_den


#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions


def getsines(n_obs, freq, density, n, show):
    ftemp = freq
    dtemp = density
    fstore = []
    dstore = []
    astore = []
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600
    for a in range(0, n, 1):
        max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
        dstore.append(max_value)
        fstore.append(ftemp[max_index])
        astore.append(np.sqrt(max_value))
        dtemp[max_index] = 0
    if show == 1:
        for b in range(0, len(fstore), 1):
            sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
            plt.plot(sound_sine)
            plt.show()
            plt.clf()
    return fstore, astore


def sine(freq, time_interval, rate, amp):
    w = 2. * np.pi * freq
    t = np.linspace(0, time_interval, time_interval * rate)
    y = amp * np.sin(w * t)
    return y


#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set


def buildFterms(dta, fstore, astore):
    n = len(fstore)
    n_obs = len(dta)
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600 + (24 * 3600)
    # Add one excess day for later fitting of sine waves to peaks
    store = []
    for i in range(0, n, 1):
        tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
        store.append(tmp)
    k_168_store = store[0]
    k_24_store = store[1]
    k_24 = np.transpose(k_24_store)
    k_168 = np.transpose(k_168_store)
    k_24 = pd.Series(k_24)
    k_168 = pd.Series(k_168)
    dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
    # Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
    k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
    # peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
    # sine waves[6]
    # print dta_ind, dta_val, k_24_ind, k_24_val
    k_24_sel = k_24[6:1014]
    k_168_sel = k_168[6:1014]
    exog = pd.concat([k_24_sel, k_168_sel], axis=1)
    return exog


#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#


def plotpacf(x, show=0, path=''):
    dflength = len(x)
    nlags = dflength * .80
    fig = plt.figure(figsize=(12, 8))
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#

def dftest(dta):
    print 'Results of Dickey-Fuller Test:'
    dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    if dfoutput[0] < dfoutput[4]:
        dfoutput['Stationary'] = 'True'
    else:
        dfoutput['Stationary'] = 'False'
    print dfoutput


#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def diffit(dta, d, show, path=''):
    templist = []
    for i in range(0, (len(dta) - d), 1):
        tempval = dta[i] - dta[i + d]
        templist.append(tempval)
    y = templist[d:len(templist)]
    y = pd.Series(y)
    plotpacf(y, show, path)
    return y


#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
    mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
    results = mod.fit()
    resids = results.resid.values
    summarised = results.summary()
    print summarised
    plotpacf(resids, show, path)
    return results


#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
    dflength = len(datrng) - 1
    observation = dta
    prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
    df = pd.concat([prediction, observation], axis=1)
    df.columns = ['predicted', 'observed']
    plt.plot(prediction)
    plt.plot(observation)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df


#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#


def pred(pred_hours, k, df, arima, show=0, path=''):
    n_obs = len(df.index)
    lastdt = df.index[n_obs - 1]
    lastdt = lastdt.to_datetime()
    datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
    future = pd.DataFrame(index=datrng, columns=df.columns)
    df = pd.concat([df, future])
    lendf = len(df.index)
    df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
    print df
    marked = 2 * pred_hours
    df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df[['predicted', 'observed']].ix[-marked:]


dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)

Data file

0 个答案:

没有答案