时间序列预测,处理已知的大订单

时间:2015-04-13 11:59:18

标签: r time-series forecasting outliers

我有许多已知异常值(大订单)的数据集

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)

这个特定系列的前11个异常值是:

outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)

有哪些方法可以预测考虑这些异常值的时间序列?

我已经尝试更换下一个最大的异常值(因此运行数据集10次,将次异常值替换为下一个最大值,直到第10个数据集替换掉所有异常值)。 我也试过简单地删除异常值(因此每次再次运行数据集10次删除异常值,直到在第10个数据集中删除所有10个异常值)

我只是想指出,删除这些大订单并不会完全删除数据点,因为该季度还会发生其他交易

我的代码通过多种预测模型测试数据(ARIMA在样本上加权,ARIMA加权在样本中,ARIMA加权,ARIMA,加性Holt-winters加权和Multiplcative Holt-winters加权)所以它需要是某种东西可以适应这些多种模型。

以下是我使用的更多数据集,但我没有这些系列的异常值,但

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432,    25894.41036,    14926.12574,    15855.8857, 21565.19002,    49373.89675,    27629.10141,    43248.9778, 34231.73851,    83379.26027,    54883.33752,    62863.47728,    47215.92508,    107819.9903,    53239.10602,    71853.5,    59912.7624, 168416.2995,    64565.6211, 94698.38748,    80229.9716, 169205.0023,    70485.55409,    133196.032, 78106.02227), ncol=2,byrow=FALSE)

data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124,    3459.15634, 2721.486863,    3286.51708, 3087.234059,    2873.810071,    2803.969394,    4336.4792,  4722.894582,    4382.349583,    3668.105825,    4410.45429, 4249.507839,    3861.148928,    3842.57616, 5223.671347,    5969.066896,    4814.551389,    3907.677816,    4944.283864,    4750.734617,    4440.221993,    3580.866991,    3942.253996,    3409.597269,    3615.729974,    3174.395507),ncol=2,byrow=FALSE)

如果这太复杂,那么在R中,如果使用某些命令一旦检测到异常值,则解释如何处理数据以进行预测。例如平滑等以及我如何接近自己编写代码(不使用检测异常值的命令)

3 个答案:

答案 0 :(得分:6)

您的异常值似乎是季节性变化,第4季度出现最大订单。您提到的许多预测模型都包含季节性调整的功能。例如,最简单的模型可以对年份进行线性依赖,并对所有季节进行校正。代码看起来像:

df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
                       "10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
                       "13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
                 order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
                        135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
                        222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
                        231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
                        329429882.8, 264012891.6, 496745973.9, 42748656.73))

seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)

给出了下图所示的结果: Linear fit with seasonal adjustments

具有季节性调整但非依赖于年份的模型可能会提供更好的拟合。

答案 1 :(得分:4)

enter image description here
您尝试用来清理异常值数据的方法不足以识别它们。我应该补充一点,在R中有一个名为tsoutliers的免费异常包,但它不会做我要告诉你的事情....

这里有一个有趣的时间序列。随着时间的推移,趋势会随着上升趋势的减弱而变化。如果您带入两个时间趋势变量,第一个从第一个开始,另一个从第14个和第14个开始,您将捕获此更改。至于季节性,您可以使用虚拟变量捕捉高四季度。该模型是parsimonios,因为其他3个季度与平均值没有差异,不需要AR12,季节性差异或3个季节性假人。您还可以捕获最后两个观察结果的影响,即具有两个虚拟变量的异常值。忽略单词趋势上方的49,因为这只是被建模系列的名称。 Actual, Fit, Forecasts with Confidence limits

答案 2 :(得分:4)

你已经说过你尝试了不同的Arima模型,但正如WaltS所提到的,你的系列似乎不包含大异常值,而是一个季节性组件,auto.arima()很好地捕获了它forecast包裹:

myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4) 
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)

enter image description here

lambda=0的{​​{1}}参数强制auto.arima()对数据进行转换(或者您可以记录日志),以考虑季节性因素的增加幅度。