我正在使用xgboost来构建模型。数据集只有200行和10000列。
我尝试使用chi-2获得100个cols,但我的混淆矩阵看起来像这样:
1 0
1 190 0
0 10 0
我尝试使用10000个属性,随机选择100个属性,根据chi-2选择100个属性,但我从未得到0个案例预测。是因为数据集,还是因为我使用xgboost的方式?
我的因子(pred.cv)总是只显示1个级别,而因子(y + 1)的级别为1或2。
param <- list("objective" = "binary:logistic",
"eval_metric" = "error",
"nthread" = 2,
"max_depth" = 5,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 0.8,
"colsample_bytree" = 0.8,
"min_child_weight" = 1,
"max_delta_step"= 5,
"learning_rate" =0.1,
"n_estimators" = 1000,
"seed"=27,
"scale_pos_weight" = 1
)
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level
答案 0 :(得分:0)
我发现插入符很慢,如果不构建自定义模型,它就无法调整xgboost模型的所有参数,而自定义模型比使用自定义函数进行评估要复杂得多。
然而,如果你正在进行一些上/下采样或者smote / rose caret是要走的路,因为它在模型评估阶段(重新采样期间)正确地将它们合并。请参阅:https://topepo.github.io/caret/subsampling-for-class-imbalances.html
然而,我发现这些技术对结果的影响非常小,而且通常情况更糟,至少在我训练的模型中。
scale_pos_weight
给予某个班级更高的分数,如果少数族群的丰富度为10%,那么在scale_pos_weight
附近玩5 - 10
应该是有益的。
调整正则化参数对于xgboost非常有用:这里有一个参数:alpha
,beta
和gamma
- 我发现有效值为0 - 3.其他有用的参数增加直接正则化(通过增加不确定性)的是subsample
,colsample_bytree
和colsample_bylevel
。我发现玩colsample_bylevel
也可以对模型产生积极的结果。您已使用subsample
和colsample_bytree
。
我会测试一个更小的eta和更多的树木,看看模型是否有益。在这种情况下,early_stopping_rounds
轮可以加快进程。
其他eval_metric
可能比准确性更有益。试试logloss
或auc
甚至map
和ndcg
这是一个用于超参数网格搜索的函数。它使用auc
作为评估指标,但可以轻松更改
xgb.par.opt=function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma = c(0, 1, 2),
eta = c(0.01, 0.03),
max_depth = c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma = currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample = currentSubsampleRate
seed = seed)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
可以在expand.grid
电话中添加其他参数。
我通常在一次CV重复上训练超参数,并在与其他种子或验证集上进行额外重复时对其进行评估(但应谨慎使用验证集以避免过度拟合)
答案 1 :(得分:0)
测试
param <- list("objective" = "binary:logistic",
"eval_metric" = "error",
"nthread" = 2,
"max_depth" = 5,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 0.8,
"colsample_bytree" = 0.8,
"min_child_weight" = 1,
"max_delta_step"= 5,
"learning_rate" =0.1,
"n_estimators" = 1000,
"seed"=27,
"scale_pos_weight" = 1
)
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level