我使用R进行分析。我的数据如下:
id timestamp cumsum
1284381 21/01/2015 33
1284381 21/01/2015 57
1284381 2/3/2015 79
1284381 4/3/2015 203
1284381 25/03/2015 475
1284381 11/4/2015 578
1284381 17/04/2015 856
1284381 21/04/2015 1189
1284381 5/5/2015 1214
1284381 10/5/2015 1321
1284381 12/5/2015 1340
1284381 15/05/2015 1529
1284381 18/05/2015 1649
1284381 19/05/2015 1977
1284381 21/05/2015 2385
1284381 23/05/2015 2528
1284381 26/05/2015 2556
1284381 29/05/2015 2705
1284381 1/6/2015 2898
1284381 4/6/2015 2913
1284381 7/6/2015 2921
1284381 13/06/2015 2922
1284381 13/06/2015 3622
1284381 16/06/2015 3834
1284381 19/06/2015 3913
1284895 27/01/2015 6
1284895 27/01/2015 49
1284895 18/03/2015 57
1284895 20/03/2015 58
1284895 23/03/2015 59
1284895 23/03/2015 60
1284895 24/03/2015 62
1284895 29/03/2015 67
1284895 31/03/2015 75
1284895 1/4/2015 76
1284895 2/4/2015 77
1284895 8/4/2015 78
1284895 16/04/2015 80
1284895 21/04/2015 103
1284895 23/04/2015 275
1284895 26/04/2015 293
1284895 27/04/2015 386
1284895 30/04/2015 539
1284895 3/5/2015 807
1284895 8/5/2015 851
1284895 11/5/2015 988
1284895 14/05/2015 1056
1284895 18/05/2015 1157
1284895 21/05/2015 1226
1284895 23/05/2015 1383
1284895 26/05/2015 1501
1284895 30/05/2015 1518
1284895 2/6/2015 1694
1284895 4/6/2015 1695
1284895 8/6/2015 1858
1284895 11/6/2015 1909
1284895 14/06/2015 1917
1284895 17/06/2015 1957
1284895 20/06/2015 1973
第一列是ID,第二列是日期,第三列是值的累积和。我想为这些数据建立一个预测模型,这可以为我提供一个解决方案,对于给定的ID,在未来的某个日期(比如。08/08/2015),cumum将是??我试过用两个变量预测模型。由于它是三个变量,而且数据是每日数据而不是连续的,我在设置模型时遇到了困难。
答案 0 :(得分:1)
这应该是相当简单的,但我相信你会想要调整它以获得更多细节。查看forecast
包以获取更多信息。这是一个很棒的工具。
示例数据:
df <- structure(list(id = c(1284381L, 1284381L, 1284381L, 1284381L,
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L,
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L,
1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L, 1284381L,
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L,
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L,
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L,
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L,
1284895L, 1284895L, 1284895L, 1284895L, 1284895L, 1284895L),
timestamp = structure(c(25L, 25L, 20L, 43L, 32L, 4L, 14L,
26L, 45L, 3L, 7L, 11L, 17L, 18L, 27L, 30L, 34L, 38L, 2L,
44L, 46L, 8L, 8L, 13L, 19L, 35L, 35L, 16L, 23L, 28L, 28L,
31L, 37L, 42L, 1L, 21L, 47L, 12L, 26L, 29L, 33L, 36L, 40L,
39L, 48L, 5L, 9L, 17L, 27L, 30L, 34L, 41L, 22L, 44L, 49L,
6L, 10L, 15L, 24L), .Label = c("1/4/2015", "1/6/2015", "10/5/2015",
"11/4/2015", "11/5/2015", "11/6/2015", "12/5/2015", "13/06/2015",
"14/05/2015", "14/06/2015", "15/05/2015", "16/04/2015", "16/06/2015",
"17/04/2015", "17/06/2015", "18/03/2015", "18/05/2015", "19/05/2015",
"19/06/2015", "2/3/2015", "2/4/2015", "2/6/2015", "20/03/2015",
"20/06/2015", "21/01/2015", "21/04/2015", "21/05/2015", "23/03/2015",
"23/04/2015", "23/05/2015", "24/03/2015", "25/03/2015", "26/04/2015",
"26/05/2015", "27/01/2015", "27/04/2015", "29/03/2015", "29/05/2015",
"3/5/2015", "30/04/2015", "30/05/2015", "31/03/2015", "4/3/2015",
"4/6/2015", "5/5/2015", "7/6/2015", "8/4/2015", "8/5/2015",
"8/6/2015"), class = "factor"), cumsum = c(33L, 57L, 79L,
203L, 475L, 578L, 856L, 1189L, 1214L, 1321L, 1340L, 1529L,
1649L, 1977L, 2385L, 2528L, 2556L, 2705L, 2898L, 2913L, 2921L,
2922L, 3622L, 3834L, 3913L, 6L, 49L, 57L, 58L, 59L, 60L,
62L, 67L, 75L, 76L, 77L, 78L, 80L, 103L, 275L, 293L, 386L,
539L, 807L, 851L, 988L, 1056L, 1157L, 1226L, 1383L, 1501L,
1518L, 1694L, 1695L, 1858L, 1909L, 1917L, 1957L, 1973L)), .Names = c("id",
"timestamp", "cumsum"), class = "data.frame", row.names = c(NA,
-59L))
代码:
# Load libraries
library(forecast)
library(dplyr)
# Function for forecasting
fc <- function(id, forecast_out){
fcast <- filter(df, id == id)
forecast(fcast$cumsum, h = forecast_out)
plot(forecast(fcast$cumsum, h = forecast_out))
}
# Forecast id 20 periods out
fc(id = 1284381, 20)
输出:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
60 1972.852 919.796925 3025.907 362.3432 3583.361
61 1972.852 427.927523 3517.777 -389.9061 4335.610
62 1972.852 2.484834 3943.219 -1040.5645 4986.269
63 1972.852 -400.473573 4346.178 -1656.8363 5602.540
64 1972.852 -798.641352 4744.345 -2265.7813 6211.485
65 1972.852 -1201.836164 5147.540 -2882.4146 6828.119
66 1972.852 -1616.650850 5562.355 -3516.8189 7462.523
67 1972.852 -2048.143856 5993.848 -4176.7305 8122.435
68 1972.852 -2500.592208 6446.296 -4868.6905 8814.395
69 1972.852 -2977.876792 6923.581 -5598.6343 9544.338
70 1972.852 -3483.701691 7429.406 -6372.2267 10317.931
71 1972.852 -4021.730885 7967.435 -7195.0714 11140.775
72 1972.852 -4595.680763 8541.385 -8072.8520 12018.556
73 1972.852 -5209.387818 9155.092 -9011.4359 12957.140
74 1972.852 -5866.861969 9812.566 -10016.9559 13962.660
75 1972.852 -6572.331508 10518.036 -11095.8784 15041.582
76 1972.852 -7330.283276 11275.987 -12255.0656 16200.770
77 1972.852 -8145.500375 12091.204 -13501.8326 17447.537
78 1972.852 -9023.098946 12968.803 -14844.0038 18789.708
79 1972.852 -9968.565065 13914.269 -16289.9694 20235.673
答案 1 :(得分:0)
我同意Amstell的建议,即检查Rob Hyndman在R中的预测包。它试图选择使用AICc作为选择标准的样本预测中1步的MSE最少的模型(如果我没记错的话) 。最近,他增加了对时间序列交叉验证程序的支持,该程序也可用于选择在预测样本时哪个模型表现最佳。你可以在这里阅读更多相关信息: Time-Series Cross-Validation
我基本上在python中执行了后两者,以便从ARIMAX模型的子集中进行选择。下面是我选择作为最优模型的验证数据集中提前1步预测。
结果模型是: [6,2,2]或AR = 6,Diff = 2,MA = 2
X变量是日期,转换为日计数。 我将对数正态变换应用于Y变量&lt; - 完全随意选择。
验证数据集中的1步错误: MAE = 88.9,MAPE = 2.8%,rsqr = 97.3%
如果您的目标是提前超过1天进行估算,那么我建议您对模型进行拟合,以便在您需要的给定天数内将交叉验证错误降至最低。同样,在这种情况下,我假设提前一天。