Question

我有一个数据集，之前已分为3组：训练，验证和测试。必须使用这些集合，以便比较不同算法的性能。

我现在想使用验证集优化SVM的参数。但是，我无法找到如何将验证集明确输入sklearn.grid_search.GridSearchCV()。下面是我之前用于在训练集上进行K折叠交叉验证的一些代码。但是，对于这个问题，我需要使用给定的验证集。我怎么能这样做？

from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV

# (some code left out to simplify things)

skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
                             class_weight=penalty_weights),
                     param_grid=tuned_parameters,
                     n_jobs=2,
                     pre_dispatch="n_jobs",
                     cv=skf,
                     scoring=scorer)
clf.fit(X_train, y_train)

Answer 1

使用PredefinedSplit

ps = PredefinedSplit(test_fold=your_test_fold)

然后在cv=ps

中设置GridSearchCV

test_fold：“array-like，shape（n_samples，）

test_fold [i]给出样本i的测试集折叠。值-1表示相应的样本不是任何测试集折叠的一部分，而是总是被放入训练折叠中。

另见here

使用验证集时，对于属于验证集的所有样本，将test_fold设置为0，对于所有其他样本，将test_fold设置为-1。

Answer 2

# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit

# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)

# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)

Answer 3

考虑使用我是作者的hypopt Python软件包（pip install hypopt）。这是一个专门为使用验证集进行参数优化而创建的专业软件包。它可以与任何现成的scikit学习模型一起使用，也可以与Tensorflow，PyTorch，Caffe2等一起使用。

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

编辑：我（认为我）在此响应中收到-1，因为我建议编写一个软件包。鉴于该软件包是专门为解决此类问题而创建的，这很不幸。

使用sklearn

3 个答案: