Question

我试图通过网格搜索找到最好的 mtry 和 ntree，但我遇到了一些问题首先，我尝试像这样找到它们：

train_control <- trainControl(method="cv", number=5)
grid <- expand.grid(.mtry=1:7, ntree = seq(100,1000,100)) # my dataset has 7 features
model_rf <- train(train_x, 
                  train_y,
                  method = "rf", 
                  tuneGrid = grid,
                  trControl = train_control)
model_rf$bestTune

但是，我收到一个错误

"The tuning parameter grid should have columns mtry"

因此，我必须使用两个步骤才能找到它们：

# find best mtry
grid <- expand.grid(.mtry=1:7)

model_rf <- train(train_x, 
                  train_y,
                  method = "rf", 
                  tuneGrid = grid,
                  trControl = train_control)
model_rf$bestTune

# find best ntree
ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
  model_rf <- train(train_x, factor(train_y), 
                    method = "rf", ntree = ntr, 
                    trControl = train_control)
  accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
  return(accuracy)
})
plot(ntree, accuracy)

在这个过程中，我遇到了一些新问题：

[1] 我发现最好的 mtry 不是恒定的。就我而言，mtry 可以是 2、4、6 和 7。那么，哪个“最佳 mtry”是最好的？我应该运行此代码 1000 次并计算平均值吗？

[2] 一般来说，最好的mtry应该是或接近于最大特征数的平方根。那么，我应该直接使用 sqrt(7) 吗？

[3] 我能在一趟火车上得到最好的 mtry 和 ntree 吗？我必须说这个过程非常耗时。

Answer 1

我认为最好在 sapply 中包含参数网格。

ntree <- seq(100,1000,100)
accuracy <- sapply(ntree, function(ntr){
grid <- expand.grid(mtry=2:7)
model_rf <- train(train_x, factor(train_y), 
                method = "rf", ntrees = ntr, 
                trControl = train_control,
                tuneGrid = grid)
accuracy <- (predict(model_fr, test_x) == test_y) %>% mean()
return(accuracy)
})
plot(ntree, accuracy)

因此您可以为每次运行 mtry 调整 ntree。 [1] mtry 和 ntrees 的最佳组合是最大化准确度（或在回归情况下最小化 RMSE）的组合，您应该选择该模型。

[2] 最大特征数的平方根是默认的mtry值，但不一定是最好的值。正是出于这个原因，您使用重采样方法来找到最佳值。

[3] 由于涉及的操作数量多，跨多个参数的模型调整本质上是一个缓慢的过程。您可以尝试在 mtry 的每个循环中包含最佳 ntrees 搜索，如我的示例代码所示。

插入符号：如何通过网格搜索找到最好的 mtry 和 ntree

1 个答案: