我正在尝试运行xgboost以解决具有非常嘈杂功能的问题,并且有兴趣根据我定义的自定义eval_metric停止轮次数。
根据领域知识,我知道当eval_metric(在训练数据上评估)超过某个值时,xgboost过度拟合。而且我想在特定数量的轮次中采用拟合模型,而不是继续进行。
实现这一目标的最佳方式是什么?
这有点符合早期停止标准,但并不完全符合。
或者,如果有可能从中间轮获得模型?
这是一个更好地解释问题的例子。 (使用xgboost帮助文档附带的玩具示例并使用默认的eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
这是输出
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
现在让我们从领域知识中说,我知道一旦列车误差低于0.015(在这种情况下是第三轮),任何进一步的轮次只会导致过度拟合。如何在第三轮之后停止训练过程并获得训练模型以将其用于不同数据集的预测?
我需要在许多不同的数据集上运行训练过程,我不知道可能需要多少轮才能训练以使误差低于固定数,因此我无法将nrounds参数设置为预定值。我只有直觉,一旦训练误差低于一个数字,我就需要停止进一步训练。
答案 0 :(得分:0)
# In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some feature indices of interest and a test set db.test
predz <- c(2,4,6,8,10,12)
predictors <- names(db.train[,predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like: eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1,4,1)) {
for (rounds in seq(1,100,1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth, nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic", verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[,predictors]), outputmargin=TRUE))
err <- rmse(as.numeric(db.test[,outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
# You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.