Question

如果我理解正确catboost，我们需要使用CV调整nrounds，就像xgboost一样。我在official tutorial 在[8]

中看到以下代码

params_with_od <- list(iterations = 500,
                       loss_function = 'Logloss',
                       train_dir = 'train_dir',
                       od_type = 'Iter',
                       od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)

哪种结果最好iterations = 211.

我的问题是：

是否正确：此命令使用test_pool选择最佳iterations而非使用交叉验证？
如果是，catboost是否提供从CV中选择最佳iterations的命令，或者我需要手动执行此操作？

Answer 1

Catboost正在进行交叉验证以确定最佳迭代次数。 train_pool和test_pool都是包含目标变量的数据集。在本教程的前面，他们写了

train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'

train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])

执行catboost.train（train_pool，test_pool，params_with_od）时，train_pool用于训练，test_pool用于通过交叉验证确定最佳迭代次数。

现在你对此感到困惑，因为在本教程的后面他们再次使用test_pool和拟合模型进行预测（model_best类似于model_with_od，但使用了不同的过度拟合检测器IncToDec）：

prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')

这可能是不好的做法。现在，他们可能会使用他们的IncToDec过度拟合探测器 - 我不熟悉它背后的数学 - 但对于Iter型过度拟合探测器，你需要有单独的训练，验证和测试数据集（如果你想成为在保存方面，对IncToDec过度拟合检测器执行相同操作）。然而，它只是一个显示功能的教程，因此我不会对他们已经使用过的数据过于迂腐。

这里有关于过度拟合探测器的更多细节的链接： https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/

Answer 2

使用Caret交叉验证。请注意tutorial的 In [12] 。

Answer 3

将迭代次数基于一个test_pool以及catboost.train（）的最佳迭代是一个非常糟糕的决定。这样，您正在将参数调整为一个特定的测试集，并且您的模型不能很好地与新数据配合使用。因此，假设像XGBoost一样，您需要正确，您需要应用CV来找到最佳的迭代次数。
catboost中确实存在CV功能。您应该做的是指定大量的迭代，并通过使用参数early_stopping_rounds在一定数量的回合后停止训练，而无需进行任何改进。不幸的是，与LightGBM不同，catboost似乎没有选择权，可以在CV之后自动为catboost.train（）应用最佳的增强轮数。因此，它需要一些解决方法。这是一个应该起作用的示例：

    library(catboost)
    library(data.table)

    parameter = list(
      thread_count = n_cores,
      loss_function = "RMSE",
      eval_metric = c("RMSE","MAE","R2"),
      iterations = 10^5, # Train up to 10^5 rounds
      early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
    )

    # Apply 6-fold CV
    model = catboost.cv(
        pool = train_pool,
        fold_count = 6,
        params = parameter
      )

      # Transform output to DT
      setDT(cbt_occupancy)
      model[, iterations := .I]
      # Order from lowest to highgest RMSE
      setorder(model, test.RMSE.mean)
      # Select iterations with lowest RMSE
      parameter$iterations = model[1, iterations]

      # Train model with optimal iterations
      model = catboost.train(
        learn_pool = train_pool,
        test_pool = test_pool,
        params = parameter
      )

Answer 4

我认为这是xgboost和catboost的普遍问题。 SELECT member_number FROM mm_members a JOIN mm_member_statement_delivery_options b ON a.member_id=b.member_id JOIN cmn_addresses c ON b.member_id=c.reference_id WHERE c.reference_type = 3 AND b.statement_delivery_method IN (2,3) AND (c.EMAIL IS NULL OR c.EMAIL = "");的选择与学习率的选择相辅相成。因此，我建议较高的回合（1000+）和较低的学习率。找到最佳的炒作参数并重试较低的学习率以检查您选择的炒作参数是否稳定。

我发现@nikitxskv的答案具有误导性。

在R tutorial中，在[12] 中，仅选择nround，而没有多个选择。因此，没有learning_rate = 0.1调优的提示。
实际上，在[12] 中仅使用函数nround来找到最佳的炒作参数。它对expand.grid，depth等的选择起作用。
在实践中，我们不使用这种方式来找到适当的gamma（太长）。

现在要回答两个问题。

是否正确：此命令使用test_pool来选择最佳迭代，而不是使用交叉验证？

是的，但是您可以使用简历。

如果是，catboost是否提供从CV中选择最佳迭代的命令，还是我需要手动进行？

这取决于你自己。如果您对增强过度拟合有很大的反感，建议您尝试一下。有很多软件包可以解决此问题。我建议使用nround个软件包。

如何使用`catboost`选择nrounds？

4 个答案: