我正在研究一个机器学习项目,它是二进制偏斜分类。
我选择尝试使用XGBoost进行预测。
X是我的数据集,没有y(目标)和y我的目标(二进制数为0和20%的1)。我有17900行和130列。
X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_val, Y_val, test_size=0.5, random_state=42)
dtrain = xgboost.DMatrix(X_train, label=Y_train)
dval = xgboost.DMatrix(X_val, label=Y_val)
dtest = xgboost.DMatrix(X_test, label=Y_test)
params = {'objective': 'binary:logistic', ‘learning _rate': 0.03, 'max_depth': 8, 'min_child_weight': 1, 'eval_metric':'auc'}
watchlist = [ (dtrain,'train'), (dval,'validation')]
model = xgboost.train(params,dtrain, evals=watchlist,early_stopping_rounds=10, num_boost_round=999)
y_train_preds = model.predict(dtrain)
cv_results = xgboost.cv(params,dtrain,seed=42,nfold=5,metrics={'auc'},early_stopping_rounds=10,stratified=True)
#ROC curve on training set
title = ["Receiver Operating Characteristic XGBoost on training set"]
fpr, tpr, threshold = metrics.roc_curve(Y_train, y_train_preds)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize=(10,10))
plt.title(title)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.5f' % roc_auc, antialiased=True)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.legend(loc = 'lower right', prop={'size': 14})
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
y_preds_test = model.predict(dtest)
title = ["Receiver Operating Characteristic XGBoost on test set"]
fpr, tpr, threshold = metrics.roc_curve(Y_test, y_preds_test)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize=(10,10))
plt.title(title)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc, antialiased=True)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.legend(loc = 'lower right', prop={'size': 14})
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
title = ["Final Confusion matrix XGBoost on test set"]
threshold = Find_Optimal_Cutoff(Y_test, y_preds_test)
pred_opt = (y_preds_test > threshold).astype(int)
cnf_matrix = confusion_matrix(Y_test, pred_opt)
plotConfusionMatrix(cnf_matrix, classes=['OK','NOK'],
title=title)
我的测试成绩为0.97 AUC,并且有很好的混淆矩阵:
[[1814 166] [60 645]]
非常感谢!