为什么XGBClassifier上的RandomizedSearchCV的最佳估算器在单独运行时会得到不同的结果?

时间:2017-05-12 18:34:34

标签: python classification xgboost

我正在尝试为XGBClassifier获得最佳的超参数,这将导致获得最具预测性的属性。我试图使用RandomizedSearchCV通过KFold进行迭代和验证。

我得到最合适的估算器并对测试子样本数据运行预测函数。然后我看着我的混淆矩阵,看到我得到了完美的结果,即使我的目标丢失了。

然后我得到最好的估算器并直接运行XGBClassifier,我的混淆矩阵结果发生了巨大的变化。我不确定我做错了什么,因为我希望在RandomizedSearchCV和Outside之外运行时,最合适的估计器是一致的。为什么我会一直得到满分?

from scipy import stats
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score,roc_auc_score

y = np.asarray(df_comb_clean[target])
df_comb_X = df_comb_clean.drop([target],1)
X = np.asarray(df_comb_X)

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.6),
              'subsample': stats.uniform(0.3, 0.9),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.9),
              'min_child_weight': [1, 2, 3, 4]
             }

numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

clf = RandomizedSearchCV(clf_xgb, 
                         param_distributions = param_dist,
                         cv = kfold_5,  
                         n_iter = 5, # you want 5 here not 25 if I understand you correctly 
                         scoring = 'roc_auc', 
                         error_score = 0, 
                         verbose = 3, 
                         n_jobs = -1)

clf.fit(X, y)

运行以下命令可以获得平均训练和测试分数。我还想拿出最好的估算器:

print "mean_train_score", clf.cv_results_['mean_train_score']
print "mean_test_score", clf.cv_results_['mean_test_score']
print clf.best_estimator_

输出:

mean_train_score [ 0.  0.  1.  1.  1.]
mean_test_score [ 0.          0.          0.76425856  0.77198744  0.74225311]
XGBClassifier(base_score=0.5, colsample_bylevel=1,
       colsample_bytree=0.76920759422068707, gamma=0,
       learning_rate=0.13626591956991532, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=880, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True,
       subsample=0.59412792468572662)

下一步,我想看看最佳估算器对我的子样本总体的表现如何,并将结果输出到混淆矩阵中:

y_train = np.asarray(df_train[target])
df_train_X = df_train.drop([target],1)
X_train = np.asarray(df_train_X)

dtrain_predictions = clf.best_estimator_.predict(X_train)
cnf_matrix_train = confusion_matrix(y_train, dtrain_predictions)  
print "train: \n" , cnf_matrix_train

y_test = np.asarray(df_test[target])
df_test_X = df_test.drop([target],1)
X_test = np.asarray(df_test_X)    

dtest_predictions = clf.best_estimator_.predict(X_test)
xpred = pd.DataFrame(dtest_predictions)
cnf_matrix_test = confusion_matrix(y_test, dtest_predictions)  
print "test: \n" , cnf_matrix_test

这给了我一个非常奇怪的输出,我不知道为什么(我在上面的部分删除了目标,甚至重置了索引):

train: 
[[3840    0]
 [   0  354]]
test: 
[[1644    0]
 [   0  150]]

下一步我拿出我最好的估算器并在RandomizedSearchCV之外重新/预测它,现在我得到相同的结果:

clf_best = XGBClassifier(base_score=0.5, colsample_bylevel=1,
       colsample_bytree=0.76920759422068707, gamma=0,
       learning_rate=0.13626591956991532, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=880, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True,
       subsample=0.59412792468572662)

df_comb_X = df_comb_clean.drop([target],1)
clf_best.fit(df_comb_X, df_comb_clean[target],eval_metric='auc')

clf_test_best= clf_best.predict(df_test_X)

cnf_best_test = confusion_matrix(y_test, clf_test_best)  
print "test: \n" , cnf_best_test

feat_imp = pd.Series(clf_best.booster().get_fscore()).sort_values(ascending=False)

测试结果:

test: 
[[1644    0]
 [   0  150]]

我弄清楚了,我的估算符合总人口,子样本是总人口的一部分。愚蠢的错误。

1 个答案:

答案 0 :(得分:0)

我在整个人群中拟合XGBC分类器,然后从相同的人群中随机分组样本,从而得出相同的结果。