Question

我最近一直在研究R软件包caret，并且在训练过程中有一个关于模型的可复制性和比较性的问题，我还无法确定。

我的意图是，每个train调用（以及因此得到的每个模型）都使用相同的交叉验证拆分，因此，交叉验证的初始存储结果与样本的非样本估计值可比。在构建过程中计算的模型。

我看到的一种方法是，您可以像这样在每次train调用之前指定种子：

set.seed(1)
model <- train(..., trControl = trainControl(...))
set.seed(1)
model2 <- train(..., trControl = trainControl(...))
set.seed(1)
model3 <- train(..., trControl = trainControl(...))

但是，在trainControl调用之间共享train对象是否表示它们通常使用相同的重采样和索引，或者是否必须将seeds参数明确传递给函数。列车控制对象在使用时是否具有随机功能，或者在声明时设置了随机功能？

我当前的方法是：

set.seed(1)
train_control <- trainControl(method="cv", ...)
model1 <- train(..., trControl = train_control)
model2 <- train(..., trControl = train_control)
model3 <- train(..., trControl = train_control)

这些火车通话是否会使用相同的拆分并具有可比性，还是我必须采取进一步措施来确保做到这一点？也就是说，在制作trainControl对象时指定种子，还是在每次训练之前调用set.seed？还是两者都有？

希望这是有道理的，并且不是全部垃圾。任何帮助

我正在查询的代码项目可以找到here。阅读起来可能会更容易，您会理解的。

Answer 1

除非在我推荐的trainControl参数中明确说明，否则在定义index时不会创建CV折叠。可以使用专门的caret函数之一创建这些函数：

createFolds
createMultiFolds
createTimeSlices
groupKFold

话虽如此，在trainControl定义之前使用特定种子将不会导致相同的CV折叠。

示例：

library(caret)
library(tidyverse)

set.seed(1)
trControl = trainControl(method = "cv",
                         returnResamp = "final",
                         savePredictions = "final")

创建两个模型：

knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)

检查相同的索引是否在相同的折叠中

knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}
#FALSE

如果您在每次train调用之前都设置了相同的种子

set.seed(1)
knnFit1 <- train(iris[,1:4], iris[,5],
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trControl)

set.seed(1)
ldaFit2 <- train(iris[,1:4], iris[,5],
                 method = "lda",
                 tuneLength = 10,
                 trControl = trControl)


set.seed(1)
rangerFit3 <- train(iris[,1:4], iris[,5],
                 method = "ranger",
                 tuneLength = 10,
                 trControl = trControl)


knnFit1$pred %>%
  left_join(ldaFit2$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

knnFit1$pred %>%
  left_join(rangerFit3$pred, by = "rowIndex") %>%
  mutate(same = Resample.x == Resample.y) %>%
  {all(.$same)}

相同的索引将在折叠中使用。但是，在使用并行计算时，我不会依赖此方法。因此，为了确保使用相同的数据拆分，最好使用index的{{1}} / indexOut参数手动定义它们。

当您手动设置index参数时，折痕将是相同的，但是由于大多数方法都包含某种随机过程，因此不能确保使用相同方法制作的模型将相同。因此，为了完全可复制，建议在每次火车通话之前先设置种子。并行运行以获得完全可复制的模型时，需要设置trainControl的{{1}}参数。

使用插入符号训练多个模型时，使用相同的trainControl对象进行交叉验证是否可以进行精确的模型比较？

1 个答案: