我正在尝试为XGBClassifier获得最佳的超参数,这将导致获得最具预测性的属性。我试图使用RandomizedSearchCV通过KFold进行迭代和验证。
我得到最合适的估算器并对测试子样本数据运行预测函数。然后我看着我的混淆矩阵,看到我得到了完美的结果,即使我的目标丢失了。
然后我得到最好的估算器并直接运行XGBClassifier,我的混淆矩阵结果发生了巨大的变化。我不确定我做错了什么,因为我希望在RandomizedSearchCV和Outside之外运行时,最合适的估计器是一致的。为什么我会一直得到满分?
from scipy import stats
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score,roc_auc_score
y = np.asarray(df_comb_clean[target])
df_comb_X = df_comb_clean.drop([target],1)
X = np.asarray(df_comb_X)
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
'learning_rate': stats.uniform(0.01, 0.6),
'subsample': stats.uniform(0.3, 0.9),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': stats.uniform(0.5, 0.9),
'min_child_weight': [1, 2, 3, 4]
}
numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)
clf = RandomizedSearchCV(clf_xgb,
param_distributions = param_dist,
cv = kfold_5,
n_iter = 5, # you want 5 here not 25 if I understand you correctly
scoring = 'roc_auc',
error_score = 0,
verbose = 3,
n_jobs = -1)
clf.fit(X, y)
运行以下命令可以获得平均训练和测试分数。我还想拿出最好的估算器:
print "mean_train_score", clf.cv_results_['mean_train_score']
print "mean_test_score", clf.cv_results_['mean_test_score']
print clf.best_estimator_
输出:
mean_train_score [ 0. 0. 1. 1. 1.]
mean_test_score [ 0. 0. 0.76425856 0.77198744 0.74225311]
XGBClassifier(base_score=0.5, colsample_bylevel=1,
colsample_bytree=0.76920759422068707, gamma=0,
learning_rate=0.13626591956991532, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=None, n_estimators=880, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True,
subsample=0.59412792468572662)
下一步,我想看看最佳估算器对我的子样本总体的表现如何,并将结果输出到混淆矩阵中:
y_train = np.asarray(df_train[target])
df_train_X = df_train.drop([target],1)
X_train = np.asarray(df_train_X)
dtrain_predictions = clf.best_estimator_.predict(X_train)
cnf_matrix_train = confusion_matrix(y_train, dtrain_predictions)
print "train: \n" , cnf_matrix_train
y_test = np.asarray(df_test[target])
df_test_X = df_test.drop([target],1)
X_test = np.asarray(df_test_X)
dtest_predictions = clf.best_estimator_.predict(X_test)
xpred = pd.DataFrame(dtest_predictions)
cnf_matrix_test = confusion_matrix(y_test, dtest_predictions)
print "test: \n" , cnf_matrix_test
这给了我一个非常奇怪的输出,我不知道为什么(我在上面的部分删除了目标,甚至重置了索引):
train:
[[3840 0]
[ 0 354]]
test:
[[1644 0]
[ 0 150]]
下一步我拿出我最好的估算器并在RandomizedSearchCV之外重新/预测它,现在我得到相同的结果:
clf_best = XGBClassifier(base_score=0.5, colsample_bylevel=1,
colsample_bytree=0.76920759422068707, gamma=0,
learning_rate=0.13626591956991532, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=None, n_estimators=880, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True,
subsample=0.59412792468572662)
df_comb_X = df_comb_clean.drop([target],1)
clf_best.fit(df_comb_X, df_comb_clean[target],eval_metric='auc')
clf_test_best= clf_best.predict(df_test_X)
cnf_best_test = confusion_matrix(y_test, clf_test_best)
print "test: \n" , cnf_best_test
feat_imp = pd.Series(clf_best.booster().get_fscore()).sort_values(ascending=False)
测试结果:
test:
[[1644 0]
[ 0 150]]
我弄清楚了,我的估算符合总人口,子样本是总人口的一部分。愚蠢的错误。
答案 0 :(得分:0)
我在整个人群中拟合XGBC分类器,然后从相同的人群中随机分组样本,从而得出相同的结果。