关于测试集的最终预测

Question

我正在研究一个机器学习项目，它是二进制偏斜分类。

我选择尝试使用XGBoost进行预测。

X是我的数据集，没有y（目标）和y我的目标（二进制数为0和20％的1）。我有17900行和130列。

X_train, X_val, Y_train, Y_val  = train_test_split(X, y, test_size=0.3, random_state=42)

X_val, X_test, Y_val, Y_test    = train_test_split(X_val, Y_val, test_size=0.5, random_state=42)





dtrain = xgboost.DMatrix(X_train, label=Y_train)

dval = xgboost.DMatrix(X_val, label=Y_val)

dtest = xgboost.DMatrix(X_test, label=Y_test)



params = {'objective': 'binary:logistic', ‘learning _rate': 0.03, 'max_depth': 8, 'min_child_weight': 1, 'eval_metric':'auc'}



watchlist = [ (dtrain,'train'), (dval,'validation')]



model = xgboost.train(params,dtrain, evals=watchlist,early_stopping_rounds=10, num_boost_round=999)



y_train_preds = model.predict(dtrain)



cv_results = xgboost.cv(params,dtrain,seed=42,nfold=5,metrics={'auc'},early_stopping_rounds=10,stratified=True)





#ROC curve on training set

title = ["Receiver Operating Characteristic XGBoost on training set"]



fpr, tpr, threshold = metrics.roc_curve(Y_train, y_train_preds)

roc_auc = metrics.auc(fpr, tpr)

plt.figure(figsize=(10,10))

plt.title(title)

plt.plot(fpr, tpr, 'b', label = 'AUC = %0.5f' % roc_auc, antialiased=True)

plt.plot([0, 1], [0, 1],'r--')

plt.xlim([0, 1])

plt.legend(loc = 'lower right', prop={'size': 14})

plt.ylim([0, 1])

plt.ylabel('True Positive Rate')

plt.xlabel('False Positive Rate')

plt.show()

关于测试集的最终预测

y_preds_test  = model.predict(dtest)



title = ["Receiver Operating Characteristic XGBoost on test set"]

fpr, tpr, threshold = metrics.roc_curve(Y_test, y_preds_test)

roc_auc = metrics.auc(fpr, tpr)

plt.figure(figsize=(10,10))

plt.title(title)

plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc, antialiased=True)

plt.plot([0, 1], [0, 1],'r--')

plt.xlim([0, 1])

plt.legend(loc = 'lower right', prop={'size': 14})

plt.ylim([0, 1])

plt.ylabel('True Positive Rate')

plt.xlabel('False Positive Rate')

plt.show()





title = ["Final Confusion matrix XGBoost on test set"]

threshold = Find_Optimal_Cutoff(Y_test, y_preds_test)

pred_opt = (y_preds_test > threshold).astype(int)

cnf_matrix = confusion_matrix(Y_test, pred_opt)

plotConfusionMatrix(cnf_matrix, classes=['OK','NOK'],

                          title=title)

我的测试成绩为0.97 AUC，并且有很好的混淆矩阵：

[[1814 166] [60 645]]

非常感谢！

Xgboost：0.97 AUC我是否过度拟合或重用了一套？

关于测试集的最终预测

0 个答案: