Question

我想在Python 3中对我的数据集使用交叉验证。但是，每次运行代码时，我都会得到不同的评估结果。如果我想要相同的结果怎么办。

lr = linear_model.LogisticRegression()
rf = ensemble.RandomForestClassifier(n_estimators = 5, criterion = 'entropy')
folds = StratifiedKFold(n_splits = 10, shuffle = True, random_state=None)

lr_scoresa = cross_val_score(lr, X, Y, scoring ='accuracy', cv = folds)
rf_scoresa = cross_val_score(rf, X, Y, scoring ='accuracy', cv = folds)
rf_scoresf = cross_val_score(rf, X, Y, scoring ='f1', cv = folds)

print(np.mean(rf_scoresa),np.mean(rf_scoresf))
print(np.mean(lr_scoresa))

Answer 1

您的问题与RandomForestClassifier和StratifiedKFold的随机性有关。我建议您将最后一个参数random_state更改为某个int（例如1）。 Documentation表明在其他情况下，行为确实是随机的：

If `None`, the random number generator is the `RandomState` instance used by `np.random`. Used when `shuffle == True`.

关键的代码行应如下所示：

folds = StratifiedKFold(n_splits = 10, shuffle = True, random_state=1)

Answer 2

folds = StratifiedKFold(n_splits = 10, shuffle = True, random_state=1)

不会导致相同的准确性得分。它只会导致相同的数据折叠。我相信最好的解决方案是在每个分类器中也为随机状态使用一个数字。

rf_scoresf = cross_val_score(rf, X, Y, scoring ='f1', cv = folds, randomstate = 42)

如何在使用交叉验证后给出相同的准确度分数结果？

2 个答案: