将ARIMA与外生回归变量一起用于R

时间:2018-10-17 00:35:59

标签: r time-series outliers arima

我想检测每小时汇总的实时数据中的异常值。在此示例中,我选择了澳大利亚墨尔本的每小时行人数据 (Pedestrian volume (updated monthly)Pedestrian Counting System

I understand there are a large number of existing detection algorithms,我会及时学习和使用。 在短期内,我想使用最简单的方法。 @Aksakal在以下stackexchange帖子中概述了一种这样的方法:

What algorithm should I use to detect anomalies on time-series?

  

我认为关键是图表中的“意外”限定符。为了   检测意外,您需要了解预期是什么。

     

我将从一个简单的时间序列模型开始,例如AR(p)或   ARMA(p,q)。使它适合数据,并适当添加季节性。对于   例如,您的SAR(1)(24)模型可能是:$ y_ {t} = c + \ phi   y_ {t-1} + \ Phi_ {24} y_ {t-24} + \ Phi_ {25} y_ {t-25} + \ varepsilon_t $,其中$ t $   以小时为单位。因此,您将在接下来的一个小时内预测图表。   每当预测错误$ e_t = y_t- \ hat y_t $“太大”时,您   发出警报。

     

当您估计模型时,您将得到方差   错误$ \ varepsilon_t $的$ \ sigma_ \ varepsilon $。取决于你   分布假设(例如正态分布)可以设置阈值   基于概率,例如$ | e_t | <3 \ sigma_ \ varepsilon $   99.7%或单面$ e_t> 3 \ sigma_ \ varepsilon $。

     

访问者的数量可能非常持久,但是超级   季节性的。尝试使用季节性假人而不是   乘性季节性,那么您可以尝试使用ARMAX,其中X代表   外生变量,可能是假日假人,小时   假人,周末假人等。

不幸的是,该帖子没有详细介绍,因此我有几个问题:

Q.1)如何从auto.arima(data,xreg = xreg)产生的拟合模型中计算/提取ARIMA错误项$ \ epsilon $的方差$ \ sigma_ \ varepsilon $? 以下是一个完整的R示例,该示例使用多个季节来捕获每日,每周和每年的季节。 进行了优化,仅作为示例实施方式提供,以帮助回答问题2。

我希望预测全年(或至少30天)的阈值,这意味着h = 24小时* 30 = 720。 本质上,我想预测的不是h每小时行人数量的平均值,而是h≥1(例如h = 720小时(30天)或什至h =)的每小时期望行人数量上限(例如3σ_ε) 24 * 365 = 8760小时(1年))。

Q.2)如何使用上述方法实现此目标?

示例代码可帮助解决上述问题。

library(rwalkr)
library(forecast)
library(tidyverse)
library(tsibble)
library(xts)
library(dygraphs)

pedestrian <- as_tibble(rwalkr::run_melb( year = c(2015:2018) )) 

pedestrian_statelibrary <- pedestrian %>% 
  filter(Sensor == "State Library") %>% 
  left_join(tsibble::holiday_aus(2015:2018, state='VIC'), by=c( 'Date' = 'date' )) %>%
  mutate(holiday = replace_na(holiday, ''),
         Count = ifelse(Count == 0, NA, Count))

# Replace all counts of zero with NA so Box-Cox transform lambda = 0 and constrain output to +ve.
pedestrian_statelibrary_train <- pedestrian_statelibrary %>% filter(Date >= as.Date('2015-05-13'), Date < as.Date('2017-01-01') )
pedestrian_statelibrary_test <- pedestrian_statelibrary %>% filter(Date >= as.Date('2017-01-01') )

# tsbox functions to convert tsibble to tz indirectly. Must be a better way of doing this...
pedestrian_statelibrary_train_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_train %>% select(Date_Time, Count) )
pedestrian_statelibrary_train_ts <-    tsbox::ts_ts(pedestrian_statelibrary_train_zoo)

pedestrian_statelibrary_test_zoo <- tsbox::ts_zoo(    pedestrian_statelibrary_test %>% select(Date_Time, Count) )
pedestrian_statelibrary_test_ts <- tsbox::ts_ts(pedestrian_statelibrary_test_zoo)


## Create external regressors.
xreg_holidays_train <- model.matrix(~as.factor(pedestrian_statelibrary_train$holiday))
xreg_holidays_train <- xreg_holidays_train[,-1]  # remove intercept.
# Remove 1st level from levels()
colnames(xreg_holidays_train) <- levels(as.factor(pedestrian_statelibrary_train$holiday))[-1]

xreg_holidays_test <- model.matrix(~as.factor(pedestrian_statelibrary_test$holiday))
xreg_holidays_test <- xreg_holidays_test[,-1]  # remove intercept.
colnames(xreg_holidays_test) <- levels(as.factor(pedestrian_statelibrary_test$holiday))[-1]

# periods (intervals(samples) per period) for hourly data.
period_day <- 24
period_week <- 24*7
period_year <- 24*365.25

seasonal_periods = c(period_day, period_week, period_year)

pedestrian_statelibrary_train_msts <- msts(pedestrian_statelibrary_train_ts,
                                     start = start(pedestrian_statelibrary_train_ts), 
                                             seasonal.periods = seasonal_periods) 

pedestrian_statelibrary_test_msts <- msts(pedestrian_statelibrary_test_ts, 
                                      start = start(pedestrian_statelibrary_test_ts), 
                                       seasonal.periods = seasonal_periods) 

# set number of Fourier terms per season. Not optimal.
Ks = c(12, 10, 2)

xreg_train <- cbind( seasonality = fourier(pedestrian_statelibrary_train_msts, K = Ks), 
                 holidays = xreg_holidays_train ) 

######################################
## Fit model of exogenous factors and ARIMA as error
######################################
fit <- pedestrian_statelibrary_train_msts %>% 
  auto.arima( xreg = xreg_train,
              seasonal=FALSE,
              stepwise = FALSE,
              parallel = TRUE,
              num.cores = NULL,
              lambda = 0
              ) 

######################################
## Forecast
######################################

fc <- forecast( fit, 
            xreg=cbind( seasonality = fourier(pedestrian_statelibrary_test_msts, K = Ks), 
                        holidays = xreg_holidays_test) 
) 

######################################
## Check residuals and accuracy.
######################################

checkresiduals(fit)

checkresiduals(fc)

accuracy(fc, pedestrian_statelibrary_test_msts)


######################################
## Display fitted model and forecast using interactive dygraph.
######################################

# Plotting `forecast` prediction using `dygraphs`
# https://stackoverflow.com/questions/43624634/plotting-forecast-prediction-using-dygraphs#43668603
as.forecast.ts <- function(forecast_obj){

  training <- forecast_obj$x
  lower <- forecast_obj$lower[,2]
  upper <- forecast_obj$upper[,2]
  point_forecast <- forecast_obj$mean

  cbind(training, lower, upper, point_forecast)
}

fc_ts <- as.forecast.ts(fc)

# Add the time stamps back to ts object.
idx_train <- pedestrian_statelibrary_train %>% ungroup() %>%    select(Date_Time) %>% as.data.frame()
idx_test <- pedestrian_statelibrary_test %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_all <- rbind(idx_train, idx_test)

# Append testing values to fc_ts object, by left joining two xts objects.
test_xts <- as.xts(x = pedestrian_statelibrary_test %>% 
                     dplyr::ungroup() %>%
                     as.data.frame() %>% 
                     dplyr::select( Count ) %>%
                     dplyr::rename( 'testing' = 'Count'), 
                   pedestrian_statelibrary_test$Date_Time)


fc_xts <- as.xts(x = fc_ts %>% 
                   as.data.frame(),
                 idx_all$Date_Time )

fc_xts <- fc_xts %>% xts::merge.xts(test_xts, join='left')

dygraph(data = fc_xts, main = "Pedestrian traffic Forecasting for State Library.") %>% 
  dyRangeSelector %>%
  dySeries(name = "training", label = "Train") %>%
  dySeries(name = 'testing', label = "Test") %>%
  dySeries(name = "point_forecast", label = "Predicted") %>%
  dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
  dyOptions(axisLineColor = "navy", gridLineColor = "grey")

0 个答案:

没有答案