在R中组合列车+测试数据和运行交叉验证

时间:2017-09-19 21:42:49

标签: r machine-learning

我有以下R代码,它在一组训练和测试数据上运行一个简单的xgboost模型,目的是预测二进制结果。

我们从

开始

1)阅读相关的图书馆。

library(xgboost)
library(readr)
library(caret)

2)清理培训和测试数据

train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])


test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])

train.c1 = subset(train.df ,  outcome == 1)
train.c0 = subset(train.df ,  outcome == 0)

3)在格式正确的数据上运行XGBoost。

train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124])) 

4)运行模型

model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)

5)做出预测

pred_xgb <- predict(model_xgb, newdata = test_xgb)

我的问题是:我如何调整此过程,以便我只是拉入/调整单个培训&#39;数据集,并获得交叉验证文件的保留集的预测?

2 个答案:

答案 0 :(得分:1)

要在xgboost调用中指定k-fold CV,需要使用xgb.cv参数调用nfold = some integer,以保存每个重采样使用prediction = TRUE参数的预测。例如:

xgboostModelCV <- xgb.cv(data = dtrain, 
                         nrounds =  1688,
                         nfold = 5,
                         objective = "binary:logistic",
                         eval_metric= "auc",
                         metrics = "auc",
                         verbose = 1,
                         print_every_n = 50,
                         stratified = T,
                         scale_pos_weight = 2
                         max_depth = 6, 
                         eta = 0.01, 
                         gamma=0,
                         colsample_bytree =  1 ,
                         min_child_weight = 1,
                         subsample=  0.5 ,
                         prediction = T)

xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples

这是一个很好的选择hyperparams的功能

function(train, seed){
  require(xgboost)
  ntrees=2000
  searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1), 
                                  colsample_bytree = c(0.6, 0.8, 1),
                                  gamma=c(0, 1, 2),
                                  eta=c(0.01, 0.03),
                                  max_depth=c(4,6,8,10))
  aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){

    #Extract Parameters to test
    currentSubsampleRate <- parameterList[["subsample"]]
    currentColsampleRate <- parameterList[["colsample_bytree"]]
    currentGamma <- parameterList[["gamma"]]
    currentEta =parameterList[["eta"]]
    currentMaxDepth =parameterList[["max_depth"]]
    set.seed(seed)

    xgboostModelCV <- xgb.cv(data = train, 
                             nrounds = ntrees,
                             nfold = 5,
                             objective = "binary:logistic",
                             eval_metric= "auc",
                             metrics = "auc",
                             verbose = 1,
                             print_every_n = 50,
                             early_stopping_rounds = 200,
                             stratified = T,
                             scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
                             max_depth = currentMaxDepth, 
                             eta = currentEta, 
                             gamma=currentGamma,
                             colsample_bytree = currentColsampleRate,
                             min_child_weight = 1,
                             subsample=  currentSubsampleRate) 


    xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
    #Save rmse of the last iteration
    auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
    auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta,  currentMaxDepth)
    names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
    print(auc)
    return(auc)
  })
  return(aucErrorsHyperparameters)
}

您可以更改网格中的网格值和参数,以及损失/评估指标。它与caret网格搜索提供的类似,但caret无法定义alphalambdacolsample_bylevelnum_parallel_tree ...网格中的超参数搜索除了定义我发现繁琐的自定义函数。 Caret具有自动预处理,CV等自动上/下采样的优势。

在xgb.cv调用之外设置种子将为CV选择相同的折叠但不是每一轮的相同树,因此您将最终得到不同的模型。即使您在xgb.cv函数调用中设置种子,也不能保证您最终会得到相同的模型但是有更高的机会(取决于线程,模型的类型...... - 我喜欢不确定性并找到它它对结果影响不大。)

答案 1 :(得分:0)

您可以使用xgb.cv并设置prediction = TRUE。