我有以下XGBoost C.V.模型。
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 20,
nfold = 3,
metrics = "auc",
verbose = TRUE,
"eval_metric" = "auc",
"objective" = "binary:logistic",
"max.depth" = 6,
"eta" = 0.01,
"subsample" = 0.5,
"colsample_bytree" = 1,
print_every_n = 1,
"min_child_weight" = 1,
booster = "gbtree",
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
我的问题是关于模型的输出和nfold
,我将nfold
设置为3
评估日志的输出如下所示;
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1 1 0.8852290 0.0023585703 0.8598630 0.005515424
2 2 0.9015413 0.0018569007 0.8792137 0.003765109
3 3 0.9081027 0.0014307577 0.8859040 0.005053600
4 4 0.9108463 0.0011838160 0.8883130 0.004324113
5 5 0.9130350 0.0008863908 0.8904100 0.004173123
6 6 0.9143187 0.0009514359 0.8910723 0.004372844
7 7 0.9151723 0.0010543653 0.8917300 0.003905284
8 8 0.9162787 0.0010344935 0.8929013 0.003582747
9 9 0.9173673 0.0010539116 0.8935753 0.003431949
10 10 0.9178743 0.0011498505 0.8942567 0.002955511
11 11 0.9182133 0.0010825702 0.8944377 0.003051411
12 12 0.9185767 0.0011846632 0.8946267 0.003026969
13 13 0.9186653 0.0013352629 0.8948340 0.002526793
14 14 0.9190500 0.0012537195 0.8954053 0.002636388
15 15 0.9192453 0.0010967155 0.8954127 0.002841402
16 16 0.9194953 0.0009818501 0.8956447 0.002783787
17 17 0.9198503 0.0009541517 0.8956400 0.002590862
18 18 0.9200363 0.0009890185 0.8957223 0.002580398
19 19 0.9201687 0.0010323405 0.8958790 0.002508695
20 20 0.9204030 0.0009725742 0.8960677 0.002581329
但是我设置nrounds = 20
但是交叉验证nfolds
= 3所以我应该输出60个结果而不是20个?
或者上面的输出正如列名所示,每轮AUC的平均得分......
因此,在培训集nround = 1
处,train_auc_mean
的结果是0.8852290
,这是3交叉验证nfolds
的平均值?
因此,如果我绘制这些AUC分数,那么我将绘制3倍交叉验证的平均AUC分数?
只是想确保一切都清楚。
答案 0 :(得分:4)
You are correct that the output is the average of the fold auc
. However if you wish to extract the individual fold auc for the best/last iteration you can proceed as follows:
An example using the Sonar data set from mlbench
library(xgboost)
library(tidyverse)
library(mlbench)
data(Sonar)
xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")
in xgb.cv
set prediction = TRUE
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 50,
early_stopping_rounds = 10,
nfold = 3,
prediction = TRUE,
eval_metric = "auc")
now go over the folds and connect the predictions with the true lables and corresponding indexes:
z <- lapply(model.cv$folds, function(x){
pred <- model.cv$pred[x]
true <- (as.numeric(Sonar$Class)-1)[x]
index <- x
out <- data.frame(pred, true, index)
out
})
give the folds names:
names(z) <- paste("folds", 1:3, sep = "_")
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc())
#output
# A tibble: 3 x 2
id auroc
<chr> <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899
the mean of these values is the same as the mean auc at best iteration:
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc()) %>%
pull(auroc) %>%
mean
#output
[1] 0.9143798
model.cv$evaluation_log[model.cv$best_iteration,]
#output
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1: 48 1 0 0.91438 0.02092817
You can of course do much more like plot auc curves for each fold and so on.