我想检测每小时汇总的实时数据中的异常值。在此示例中,我选择了澳大利亚墨尔本的每小时行人数据 (Pedestrian volume (updated monthly),Pedestrian Counting System)
I understand there are a large number of existing detection algorithms,我会及时学习和使用。 在短期内,我想使用最简单的方法。 @Aksakal在以下stackexchange帖子中概述了一种这样的方法:
What algorithm should I use to detect anomalies on time-series?
我认为关键是图表中的“意外”限定符。为了 检测意外,您需要了解预期是什么。
我将从一个简单的时间序列模型开始,例如AR(p)或 ARMA(p,q)。使它适合数据,并适当添加季节性。对于 例如,您的SAR(1)(24)模型可能是:$ y_ {t} = c + \ phi y_ {t-1} + \ Phi_ {24} y_ {t-24} + \ Phi_ {25} y_ {t-25} + \ varepsilon_t $,其中$ t $ 以小时为单位。因此,您将在接下来的一个小时内预测图表。 每当预测错误$ e_t = y_t- \ hat y_t $“太大”时,您 发出警报。
当您估计模型时,您将得到方差 错误$ \ varepsilon_t $的$ \ sigma_ \ varepsilon $。取决于你 分布假设(例如正态分布)可以设置阈值 基于概率,例如$ | e_t | <3 \ sigma_ \ varepsilon $ 99.7%或单面$ e_t> 3 \ sigma_ \ varepsilon $。
访问者的数量可能非常持久,但是超级 季节性的。尝试使用季节性假人而不是 乘性季节性,那么您可以尝试使用ARMAX,其中X代表 外生变量,可能是假日假人,小时 假人,周末假人等。
不幸的是,该帖子没有详细介绍,因此我有几个问题:
Q.1)如何从auto.arima(data,xreg = xreg)产生的拟合模型中计算/提取ARIMA错误项$ \ epsilon $的方差$ \ sigma_ \ varepsilon $? 以下是一个完整的R示例,该示例使用多个季节来捕获每日,每周和每年的季节。 未进行了优化,仅作为示例实施方式提供,以帮助回答问题2。
我希望预测全年(或至少30天)的阈值,这意味着h = 24小时* 30 = 720。 本质上,我想预测的不是h每小时行人数量的平均值,而是h≥1(例如h = 720小时(30天)或什至h =)的每小时期望行人数量上限(例如3σ_ε) 24 * 365 = 8760小时(1年))。
Q.2)如何使用上述方法实现此目标?
示例代码可帮助解决上述问题。
library(rwalkr)
library(forecast)
library(tidyverse)
library(tsibble)
library(xts)
library(dygraphs)
pedestrian <- as_tibble(rwalkr::run_melb( year = c(2015:2018) ))
pedestrian_statelibrary <- pedestrian %>%
filter(Sensor == "State Library") %>%
left_join(tsibble::holiday_aus(2015:2018, state='VIC'), by=c( 'Date' = 'date' )) %>%
mutate(holiday = replace_na(holiday, ''),
Count = ifelse(Count == 0, NA, Count))
# Replace all counts of zero with NA so Box-Cox transform lambda = 0 and constrain output to +ve.
pedestrian_statelibrary_train <- pedestrian_statelibrary %>% filter(Date >= as.Date('2015-05-13'), Date < as.Date('2017-01-01') )
pedestrian_statelibrary_test <- pedestrian_statelibrary %>% filter(Date >= as.Date('2017-01-01') )
# tsbox functions to convert tsibble to tz indirectly. Must be a better way of doing this...
pedestrian_statelibrary_train_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_train %>% select(Date_Time, Count) )
pedestrian_statelibrary_train_ts <- tsbox::ts_ts(pedestrian_statelibrary_train_zoo)
pedestrian_statelibrary_test_zoo <- tsbox::ts_zoo( pedestrian_statelibrary_test %>% select(Date_Time, Count) )
pedestrian_statelibrary_test_ts <- tsbox::ts_ts(pedestrian_statelibrary_test_zoo)
## Create external regressors.
xreg_holidays_train <- model.matrix(~as.factor(pedestrian_statelibrary_train$holiday))
xreg_holidays_train <- xreg_holidays_train[,-1] # remove intercept.
# Remove 1st level from levels()
colnames(xreg_holidays_train) <- levels(as.factor(pedestrian_statelibrary_train$holiday))[-1]
xreg_holidays_test <- model.matrix(~as.factor(pedestrian_statelibrary_test$holiday))
xreg_holidays_test <- xreg_holidays_test[,-1] # remove intercept.
colnames(xreg_holidays_test) <- levels(as.factor(pedestrian_statelibrary_test$holiday))[-1]
# periods (intervals(samples) per period) for hourly data.
period_day <- 24
period_week <- 24*7
period_year <- 24*365.25
seasonal_periods = c(period_day, period_week, period_year)
pedestrian_statelibrary_train_msts <- msts(pedestrian_statelibrary_train_ts,
start = start(pedestrian_statelibrary_train_ts),
seasonal.periods = seasonal_periods)
pedestrian_statelibrary_test_msts <- msts(pedestrian_statelibrary_test_ts,
start = start(pedestrian_statelibrary_test_ts),
seasonal.periods = seasonal_periods)
# set number of Fourier terms per season. Not optimal.
Ks = c(12, 10, 2)
xreg_train <- cbind( seasonality = fourier(pedestrian_statelibrary_train_msts, K = Ks),
holidays = xreg_holidays_train )
######################################
## Fit model of exogenous factors and ARIMA as error
######################################
fit <- pedestrian_statelibrary_train_msts %>%
auto.arima( xreg = xreg_train,
seasonal=FALSE,
stepwise = FALSE,
parallel = TRUE,
num.cores = NULL,
lambda = 0
)
######################################
## Forecast
######################################
fc <- forecast( fit,
xreg=cbind( seasonality = fourier(pedestrian_statelibrary_test_msts, K = Ks),
holidays = xreg_holidays_test)
)
######################################
## Check residuals and accuracy.
######################################
checkresiduals(fit)
checkresiduals(fc)
accuracy(fc, pedestrian_statelibrary_test_msts)
######################################
## Display fitted model and forecast using interactive dygraph.
######################################
# Plotting `forecast` prediction using `dygraphs`
# https://stackoverflow.com/questions/43624634/plotting-forecast-prediction-using-dygraphs#43668603
as.forecast.ts <- function(forecast_obj){
training <- forecast_obj$x
lower <- forecast_obj$lower[,2]
upper <- forecast_obj$upper[,2]
point_forecast <- forecast_obj$mean
cbind(training, lower, upper, point_forecast)
}
fc_ts <- as.forecast.ts(fc)
# Add the time stamps back to ts object.
idx_train <- pedestrian_statelibrary_train %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_test <- pedestrian_statelibrary_test %>% ungroup() %>% select(Date_Time) %>% as.data.frame()
idx_all <- rbind(idx_train, idx_test)
# Append testing values to fc_ts object, by left joining two xts objects.
test_xts <- as.xts(x = pedestrian_statelibrary_test %>%
dplyr::ungroup() %>%
as.data.frame() %>%
dplyr::select( Count ) %>%
dplyr::rename( 'testing' = 'Count'),
pedestrian_statelibrary_test$Date_Time)
fc_xts <- as.xts(x = fc_ts %>%
as.data.frame(),
idx_all$Date_Time )
fc_xts <- fc_xts %>% xts::merge.xts(test_xts, join='left')
dygraph(data = fc_xts, main = "Pedestrian traffic Forecasting for State Library.") %>%
dyRangeSelector %>%
dySeries(name = "training", label = "Train") %>%
dySeries(name = 'testing', label = "Test") %>%
dySeries(name = "point_forecast", label = "Predicted") %>%
dyLegend(show = "always", hideOnMouseOut = FALSE) %>%
dyOptions(axisLineColor = "navy", gridLineColor = "grey")