Question

我试图获得SVR模型的最佳参数集。我想在GridSearchCV的不同值上使用C。但是，从之前的测试中我发现，分成训练/测试集高可影响整体表现（在这种情况下为r2）。为了解决这个问题，我想实施重复的5倍交叉验证（10 x 5CV）。是否有内置的方法使用GridSearchCV执行它？

快速解决方案：

根据科幻小说offical documentation中提出的想法，快速解决方案代表：

NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
     cv = KFold(n_splits=5, shuffle=True, random_state=i)
     clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
     scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))

Answer 1

这称为嵌套的cross_validation。您可以查看official documentation example以指导您进入正确的方向，并查看我的other answer here以获得类似的方法。

您可以根据自己的需要调整步骤：

svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ...  ]}

# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.

# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)

# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_

# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()

修改 - 使用cross_val_score()和GridSearchCV()

的嵌套交叉验证说明

clf = GridSearchCV（estimator，param_grid，cv = inner_cv）。
将clf, X, y, outer_cv传递给cross_val_score
如source code of cross_val_score所示，此X将使用X_outer_train, X_outer_test分为outer_cv。 y相同。
X_outer_test将被取消，X_outer_train将被传递给clf for fit（）（在我们的例子中为GridSearchCV）。 假设X_outer_train从此处被称为X_inner，因为它已传递给内部估算工具，假设y_outer_train为y_inner。
X_inner现在将使用GridSearchCV中的X_inner_train分为X_inner_test和inner_cv。同样适用于
现在，我们将使用X_inner_train和y_train_inner对gridSearch估算工具进行培训，并使用X_inner_test和y_inner_test进行评分。
对于inner_cv_iters（本例中为5），步骤5和6将重复。
所有内部迭代(X_inner_train, X_inner_test)的平均分数最佳的超参数会传递到clf.best_estimator_并适合所有数据，即X_outer_train。
然后，使用clf和gridsearch.best_estimator_对此X_outer_test（y_outer_test）进行评分。
对于outer_cv_iters（此处为10），步骤3到9将重复，并且会从cross_val_score
然后我们使用mean（）返回nested_score。

scikit-learn GridSearchCV多次重复

1 个答案: