对于二进制分类问题,XGBoost给我100%的预测精度。这似乎太不可思议了。我该怎么解决?
我正在使用归一化的数据集(最大-最小或z得分),已经将其拆分为训练和验证集,并且我正在使用训练集值以预测验证集。在这两个子集中,数据非常相似,但是我无能为力。我还避免了前瞻性偏见。还有什么可能是100%准确性的原因,我该如何解决?非常感谢你!
我的代码是:
train_x=data.matrix(tmp[,-40])
train_y=tmp[,40]
test_x=data.matrix(tmp2[,-40])
test_y=tmp2[,40]
test_y=as.factor(test_y)
xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)
set.seed(12345)
xgbc=xgboost(data=xgb_train, max.depth=4, nrounds=200)
print(xgbc)
preds=predict(xgbc,test_x)
preds[preds>0.5] = "1"
pred_y = as.factor(test_y)
print(pred_y)
cm = confusionMatrix(test_y, pred_y)
print(cm)
代码输出为:
> xgbc=xgboost(data=xgb_train,max.depth=4, nrounds=200, nthread=2, eta=1,
objective="binary:logistic")
[1] train-error:0.415888
[2] train-error:0.390654
[3] train-error:0.368692
[4] train-error:0.323832
[5] train-error:0.307944
[6] train-error:0.278037
[7] train-error:0.259346
[8] train-error:0.240187
[9] train-error:0.232710
[10] train-error:0.224766
[11] train-error:0.208879
[12] train-error:0.192523
[13] train-error:0.185981
[14] train-error:0.177103
[15] train-error:0.168224
[16] train-error:0.157944
[17] train-error:0.141121
[18] train-error:0.132243
[19] train-error:0.132243
[20] train-error:0.121495
[21] train-error:0.109346
[22] train-error:0.101869
[23] train-error:0.100000
[24] train-error:0.090654
[25] train-error:0.080374
[26] train-error:0.078505
[27] train-error:0.069626
[28] train-error:0.063084
[29] train-error:0.066822
[30] train-error:0.056542
[31] train-error:0.044860
[32] train-error:0.042991
[33] train-error:0.039252
[34] train-error:0.037383
[35] train-error:0.029439
[36] train-error:0.023832
[37] train-error:0.018692
[38] train-error:0.011682
[39] train-error:0.011215
[40] train-error:0.010748
[41] train-error:0.009346
[42] train-error:0.007477
[43] train-error:0.005140
[44] train-error:0.005140
[45] train-error:0.006075
[46] train-error:0.003271
[47] train-error:0.002804
[48] train-error:0.003271
[49] train-error:0.002804
[50] train-error:0.002804
[51] train-error:0.002336
[52] train-error:0.002336
[53] train-error:0.002336
[54] train-error:0.002336
[55] train-error:0.000935
[56] train-error:0.000467
[57] train-error:0.000000
[58] train-error:0.000000
[59] train-error:0.000000
[60] train-error:0.000935
[61] train-error:0.000467
[62] train-error:0.000000
[63] train-error:0.000000
[64] train-error:0.000000
[65] train-error:0.000000
[66] train-error:0.000000
[67] train-error:0.000000
[68] train-error:0.000000
[69] train-error:0.000000
[70] train-error:0.000000
[71] train-error:0.000000
[72] train-error:0.000000
[73] train-error:0.000000
[74] train-error:0.000000
[75] train-error:0.000000
[76] train-error:0.000000
[77] train-error:0.000000
[78] train-error:0.000000
[79] train-error:0.000000
[80] train-error:0.000000
[81] train-error:0.000000
[82] train-error:0.000000
[83] train-error:0.000000
[84] train-error:0.000000
[85] train-error:0.000000
[86] train-error:0.000000
[87] train-error:0.000000
[88] train-error:0.000000
[89] train-error:0.000000
[90] train-error:0.000000
[91] train-error:0.000000
[92] train-error:0.000000
[93] train-error:0.000000
[94] train-error:0.000000
[95] train-error:0.000000
[96] train-error:0.000000
[97] train-error:0.000000
[98] train-error:0.000000
[99] train-error:0.000000
[100] train-error:0.000000
> print(xgbc)
##### xgb.Booster
raw: 186.6 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max.depth = 4, nthread = 2, eta = 1,
objective = "binary:logistic")
params (as set within xgb.train):
max_depth = "4", nthread = "2", eta = "1", objective = "binary:logistic",
silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
# of features: 38
niter: 200
nfeatures : 38
evaluation_log:
iter train_error
1 0.415888
2 0.390654
---
199 0.000000
200 0.000000
preds=predict(xgbc,test_x)
> preds
[1] 7.273692e-01 1.643806e-02 3.032141e-04 9.764441e-01 9.691942e-02
5.343258e-01 9.090783e-01
[8] 5.609832e-01 4.061035e-01 1.105066e-01 4.406907e-03 9.946358e-01
7.929156e-01 4.119191e-03
[15] 3.098451e-01 2.945659e-04 3.966548e-03 7.829595e-01 1.698021e-01
9.574184e-01 7.132806e-01
[22] 1.044374e-01 9.024003e-01 5.769060e-01 5.096554e-02 1.751429e-01
9.982671e-01 9.993696e-01
[29] 6.521277e-01 5.780852e-03 4.867651e-01 9.707865e-01 8.398834e-01
1.825542e-01 1.134274e-01
[36] 7.154977e-02 5.450470e-01 1.047506e-01 3.099218e-03 2.268739e-01
9.023346e-01 8.026977e-01
[43] 3.844074e-01 4.463347e-01 8.543612e-01 9.998935e-01 8.699111e-01
6.243381e-02 1.137973e-01
[50] 9.385086e-01 9.994442e-01 8.376440e-01 8.492180e-01 3.362629e-04
4.316351e-02 9.234415e-01
[57] 8.924388e-01 9.977444e-01 6.618840e-02 2.186051e-04 1.647688e-03
8.050095e-03 6.535615e-01
[64] 4.707330e-01 9.138927e-01 5.177013e-02 3.349773e-04 9.392425e-01
4.979803e-02 2.934091e-01
[71] 8.948106e-01 9.854530e-01 9.795361e-02 9.275551e-01 5.865968e-01
9.746857e-01 3.859183e-01
[78] 1.194406e-01 3.267710e-01 6.294726e-01 9.250816e-01 6.118813e-02
3.394562e-01 7.257250e-04
[85] 8.491386e-01 7.081388e-03 3.268852e-01 8.931246e-01 2.204458e-01
8.818560e-01 9.923303e-01
[92] 9.845840e-01 7.688413e-01 9.803721e-01 9.958567e-01 9.500723e-01
7.733757e-01 9.368727e-01
[99] 3.276393e-01 9.952766e-01 2.130413e-01 8.992375e-02 8.594028e-02
8.160641e-01 9.915828e-01
> preds[preds>0.5] = "1"
> preds[preds<=0.5]= "0"
> pred_y = as.factor(test_y)
> print(pred_y)
[1] 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1
1 0 1 1 1 0 1 0 1 1 1 1 0 0
[51] 1 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1
1 1 0 0 1 0 0 0 1 1 1 1 0 1
> test_y=as.factor(test_y)
> cm = confusionMatrix(test_y, pred_y)
> print(cm)
Confusion Matrix and Statistics
Reference
预测0 1 0 421 0 1 0 497
Accuracy : 1
95% CI : (0.996, 1)
No Information Rate : 0.5414
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
麦克纳马尔的检验P值:不适用
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4586
Detection Rate : 0.4586
检测患病率:0.4586
平衡精度:1.0000
'Positive' Class : 0
答案 0 :(得分:0)
看起来您对训练数据严重过度拟合,您应该使用交叉验证而不是单纯的训练-测试拆分。有多种方法可以做到这一点。您可以使用 R 中 xgb.cv
包内的 xgboost
来实现,例如。我更喜欢 Tidymodels
,但这是一个不同的兔子洞。我的猜测是,如果你调整像 gamma
这样的参数,你最终会得到非零损失,因为 gamma > 0
将有助于通过修剪树来防止过度拟合。您还可以通过种植更少、更浅的树、子采样特征等来帮助防止过度拟合。所有这些选项都可以通过 xgb.cv
答案 1 :(得分:-2)
尝试检查预测变量与输出的相关性。尝试删除具有高相关性的变量,因为它会引入高偏差。这以 100% 的准确率解决了我的问题。