Question

我有一个类似以下的数据集

set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T))

foo[, period := 1:.N, by = group]

问题：我想使用变量y对每个group预测x1, ..., x5领先一步

我想在caret中运行一些模型来决定使用哪个模型。

到目前为止，我正在使用时间片在循环中运行它

window.length <- 115
timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length,
                            horizon           = 1, 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

model_list <- list()
for(g in unique(foo$group)){
  for(model in c("xgbTree", "earth", "cubist")){
    dat <- foo[group == g][, c('group', 'period') := NULL]
    model_list[[g]][[model]] <- train(y ~ . - 1,
                                      data = dat,
                                      method = model, 
                                      trControl = timecontrol)

  }
}

但是，我想同时运行所有组，使用虚拟变量来标识每个组，例如

dat <- cbind(foo,  model.matrix(~ group- 1, foo))
            y         x1       x2       x3            x4 x5 period groupA groupB groupC groupD groupE groupF
  1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04  1      1      1      0      0      0      0      0
  2: 3.442213  8.6558983 32.41881 45.70801  3.255423e-01  1      1      0      1      0      0      0      0
  3: 3.485286  7.7295448 21.99022 56.42133  8.668391e+00  1      1      0      0      1      0      0      0
  4: 9.659601  0.9166456 30.34609 55.72661 -7.666063e+00  1      1      0      0      0      1      0      0
  5: 5.567950  3.0306864 22.07813 52.21099  5.377153e-01  1      1      0      0      0      0      1      0

但是仍然使用timeslice以正确的时间顺序运行时间序列。

是否有一种方法可以在time中声明trainControl变量，因此在这种情况下，我的one step ahead预测在每个回合中使用另外六个观察值，并删除前六个观察值？

我可以通过对数据进行排序并弄乱horizon参数（给定n组，按时间变量排序并放入horizon = n）来做到这一点，但是如果组数改变。并且initial.window必须是time * n_groups

timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length * length(unique(foo$group)),
                            horizon           = length(unique(foo$group)), 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

还有其他方法吗？

Answer 1

我认为您正在寻找的答案实际上非常简单。您可以使用skip的{{1}}参数在每个训练/测试集之后跳过所需的观察次数。这样，您只需要预测每个组的时间，训练组和测试组就不会分裂相同的时间段，也不会泄漏信息。

使用您提供的示例，如果您设置trainControl()和skip = 6（组数）和horizon = 6，则第一个测试集将包含期间116的所有组，下一个测试集将包含时段117的所有组，依此类推。

initialWindow = 115

Answer 2

我将使用tidyr::nest()来嵌套组，然后使用purrr::map()遍历数据。这种方法更加灵活，因为它可以容纳不同的组大小，不同数量的组以及传递给caret::train()的变量模型或其他参数。另外，您可以使用furrr轻松地并行运行所有内容。

加载程序包并创建数据

我使用tibble而不是data.table。我还减少了数据的大小。

library(caret)
library(tidyverse)

set.seed(503)

foo <- tibble(
  group = rep(LETTERS[1:6], 10),
  y  = rnorm(n = 6 * 10, mean = 5, sd = 2),
  x1 = rnorm(n = 6 * 10, mean = 5, sd = 10),
  x2 = rnorm(n = 6 * 10, mean = 25, sd = 10),
  x3 = rnorm(n = 6 * 10, mean = 50, sd = 10),
  x4 = rnorm(n = 6 * 10, mean = 0.5, sd = 10),
  x5 = sample(c(1, 0), size = 6 * 10, replace = T)
) %>%
  group_by(group) %>%
  mutate(period = row_number()) %>%
  ungroup()

减小`initialWindow`的大小

window.length <- 9
timecontrol   <- trainControl(
  method          = 'timeslice',
  initialWindow     = window.length,
  horizon           = 1,
  selectionFunction = "best",
  fixedWindow       = TRUE,
  savePredictions   = 'final'
)

创建一个将返回拟合模型对象列表的函数

# To fit each model in model_list to data and return model fits as a list.
fit_models <- function(data, model_list, timecontrol) {
  map(model_list,
      ~ train(
        y ~ . - 1,
        data = data,
        method = .x,
        trControl = timecontrol
      )) %>%
    set_names(model_list)
}

合适的型号

model_list <- c("xgbTree", "earth", "cubist")
mods <- foo %>% 
  nest(-group) 

mods <- mods %>%
  mutate(fits = map(
    data,
    ~ fit_models(
      data = .x,
      model_list = model_list,
      timecontrol = timecontrol
    )
  ))

如果要查看特定组/模型的结果，可以执行以下操作：

mods[which(mods$group == "A"), ]$fits[[1]]$xgbTree

使用`furrr`进行并行处理

只需使用plan(multiprocess)初始化worker并将map更改为future_map。请注意，如果您的计算机的处理核心少于6个，则可能需要将worker数量更改为少于6个。

library(furrr)
plan(multiprocess, workers = 6)

mods <- foo %>% 
  nest(-group) 

mods <- mods %>%
  mutate(fits = future_map(
    data,
    ~ fit_models(
      data = .x,
      model_list = model_list,
      timecontrol = timecontrol
    )
  ))

按组在插入符中训练时间序列模型

2 个答案:

加载程序包并创建数据

减小`initialWindow`的大小

创建一个将返回拟合模型对象列表的函数

合适的型号

使用`furrr`进行并行处理

按组在插入符中训练时间序列模型

2 个答案:

加载程序包并创建数据

减小initialWindow的大小

创建一个将返回拟合模型对象列表的函数

合适的型号

使用furrr进行并行处理

减小`initialWindow`的大小

使用`furrr`进行并行处理