我正在使用MNIST数据集练习分类(以AurélienGéron撰写的“ Hands-On”一书为参考)。在训练了不同的模型之后,我选择了RandomForestClassifier,它适合于训练集,该训练集使用附加图像进行了扩展(原始图像的移位版本)并进行了缩放。在测试集上检查模型的准确性后,与在训练集上使用3倍的cross_val_score相比,它的性能要差约4%。为什么会这样?
由于我觉得很奇怪,所以我决定用我训练过的其他一些模型测试预测测试集的准确性,包括使用OvA的SGD,使用OvO的SGD和RandomForestClassifier。对于每一个,我都检查了使用缩放测试集对带有缩放训练集的模型进行训练的方式,以及如何对使用非缩放测试集对带有缩放训练集的模型进行训练的方式。
SGD模型的执行效果与预期的大致相同(与训练集的cross_val_score大致相同),包括在每种情况下,在按比例缩放的数据上训练的模型比在按比例缩放的数据上训练的模型表现更好;甚至对未经缩放的数据进行训练的RandomForestClassifier也在测试集上按预期方式执行-只有经过缩放的数据训练的RandomForestClassifier出现了显着下降(尽管使用cross_val_score表现最佳)。
#Showing the cross_val_score results of the RandomForestClassifier with scaled data during training
forest_clf_accuracies_scaled_expanded = cross_val_score(forest_clf_scaled_expanded, X_train_scaled_expanded, y_train_expanded, cv=3, scoring = 'accuracy')
forest_clf_accuracy_mean_scaled_2 = forest_clf_accuracies_scaled_2.mean()
print(forest_clf_accuracies_scaled_2)
print(forest_clf_accuracy_mean_scaled_2)
OUTPUT: [0.93771246 0.94149707 0.94169125]
OUTPUT: 0.9403002620167648
#Testing that model on the test set returns worse than expected result
final_model = forest_clf_scaled_expanded
prepared_test_set = scaler.fit_transform(X_test.astype(np.float64))
final_predictions = final_model.predict(prepared_test_set)
final_accuracy = final_model.score(prepared_test_set, y_test)
final_accuracy
OUTPUT: 0.9036
#Out of curiosity, I see how the RandomForestClassifier trained on non-scaled data performs; it performs as expected (it's cross_val_score had been ~.94)
forest_clf_2.score(X_test, y_test)
OUTPUT: 0.9475
#Now I wanted to see how scaled and non-scaled data performed with the other classifiers; here are the test set results for SGD using OvO; they both perform on par with their cross_val_scores, with the scaled data performing a bit better than the unscaled
ovo_clf_scaled_expanded.score(prepared_test_set, y_test)
OUTPUT: 0.9265
ovo_clf_expanded.score(X_test, y_test)
OUTPUT: 0.9189
我期望RandomForestClassifier在缩放数据上的输出大约为.94,但是结果大约为.90。
谢谢。