我有以下R代码,它在一组训练和测试数据上运行一个简单的xgboost模型,目的是预测二进制结果。
我们从
开始1)阅读相关的图书馆。
library(xgboost)
library(readr)
library(caret)
2)清理培训和测试数据
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3)在格式正确的数据上运行XGBoost。
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4)运行模型
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5)做出预测
pred_xgb <- predict(model_xgb, newdata = test_xgb)
我的问题是:我如何调整此过程,以便我只是拉入/调整单个培训&#39;数据集,并获得交叉验证文件的保留集的预测?
答案 0 :(得分:1)
要在xgboost调用中指定k-fold CV,需要使用xgb.cv
参数调用nfold = some integer
,以保存每个重采样使用prediction = TRUE
参数的预测。例如:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
这是一个很好的选择hyperparams的功能
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
您可以更改网格中的网格值和参数,以及损失/评估指标。它与caret
网格搜索提供的类似,但caret
无法定义alpha
,lambda
,colsample_bylevel
,num_parallel_tree
...网格中的超参数搜索除了定义我发现繁琐的自定义函数。 Caret
具有自动预处理,CV等自动上/下采样的优势。
在xgb.cv调用之外设置种子将为CV选择相同的折叠但不是每一轮的相同树,因此您将最终得到不同的模型。即使您在xgb.cv函数调用中设置种子,也不能保证您最终会得到相同的模型但是有更高的机会(取决于线程,模型的类型...... - 我喜欢不确定性并找到它它对结果影响不大。)
答案 1 :(得分:0)
您可以使用xgb.cv并设置prediction = TRUE。