Question

我正在scikit-learn中使用决策树对垃圾邮件进行分类。在这里和其他地方阅读了各种文章之后，我将我的初始数据集分为训练和测试，并使用交叉验证对训练集执行了超参数调整。以我的理解，应该在训练和测试中计算分数，以检查模型是否过拟合；考虑到测试集上的分数很好，我是否可以排除这个问题，并提出从整个数据集中获得的分数？还是应该显示测试集的结果？这是用于训练/测试集的代码：

scores = cross_val_score(tree, x_train, y_train, cv=10)
scores_pre = cross_val_score(tree, x_train, y_train, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_train, y_train, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_train, y_train, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.02)
Precision: 0.98 (+/- 0.02)
F-Measure: 0.98 (+/- 0.01)
Recall: 0.98 (+/- 0.02)

scores = cross_val_score(tree, x_test, y_test, cv=10)
scores_pre = cross_val_score(tree, x_test, y_test, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, x_test, y_test, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, x_test, y_test, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.95 (+/- 0.03)
Precision: 0.96 (+/- 0.02)
F-Measure: 0.96 (+/- 0.02)
Recall: 0.97 (+/- 0.03)

这是整个数据集的代码：

scores = cross_val_score(tree, X, y, cv=10)
scores_pre = cross_val_score(tree, X, y, cv=10, scoring="precision")
scores_f1 = cross_val_score(tree, X, y, cv=10, scoring="f1")
scores_recall = cross_val_score(tree, X, y, cv=10, scoring="recall")
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print("Precision: %0.2f (+/- %0.2f)" % (scores_pre.mean(), scores_pre.std() * 2))
print("F-Measure: %0.2f (+/- %0.2f)" % (scores_f1.mean(), scores_f1.std() * 2))
print("Recall: %0.2f (+/- %0.2f)" % (scores_recall.mean(), scores_recall.std() * 2))

Accuracy: 0.97 (+/- 0.04)
Precision: 0.98 (+/- 0.03)
F-Measure: 0.98 (+/- 0.03)
Recall: 0.98 (+/- 0.03)

Answer 1

不，您的最终报告分数应始终位于测试集上，而实际上是验证集。

Scikit学习：调整超参数后，对整个数据集使用交叉验证

1 个答案: