我有以下数据管道,但是在解释输出时遇到了一些困惑。非常感谢任何帮助。
# tune the hyperparameters via a cross-validated grid search
from sklearn.ensemble import RandomForestClassifier
print("[INFO] tuning hyperparameters via grid search")
params = {"max_depth": [3, None],
"max_features": [1, 2, 3, 4],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
model = RandomForestClassifier(50)
grid = RandomizedSearchCV(model, params, cv=10, scoring = 'roc_auc')
start = time()
grid.fit(X_train, y_train)
# evaluate the best grid searched model on the testing data
print("[INFO] grid search took {:.2f} seconds".format(
time() - start))
acc = grid.score(X_train, y_train)
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] grid search best parameters: {}".format(
grid.best_params_))
查看交叉验证的培训分数:
rf_score_train = grid.score(X_train, y_train)
rf_score_train
0.87845540607872441
现在使用这个训练过的模型来预测测试集:
rf_score_test = grid.score(X_test, y_test)
rf_score_test
0.72482993197278911
但是,当我将此模型的预测视为一个数组并使用外部roc_auc_score指标将此预测与实际结果进行比较时,我得到的测试集上面的GridSearchCV'roc_auc'得分完全不同。< / p>
model_prediction = grid.predict(X_test)
model_prediction
array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
0,0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0,0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0,0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
0,0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0,0, 1, 0, 0, 0, 0, 0, 0])
实际结果:
actual_outcome = np.array(y_test)
actual_outcome
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0,0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
1,1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0,0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
0,0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1,0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0,0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,
0,0, 0, 1, 0, 0, 0, 1, 0])
在GridSearch之外使用roc_auc_score:
from sklearn.metrics import roc_auc_score
roc_accuracy = roc_auc_score(actual_outcome, model_prediction)*100
roc_accuracy
59.243197278911566
因此,在GridSearch中使用交叉验证的'roc_auc'得分我得到72左右,但当我在外部使用'roc_auc_score'时,我得到59.哪一个是正确的?我很迷惑。我在这里做错了吗?非常感谢任何帮助!