我想知道,在sklearn中的GridSearchCV方法中实现的默认交叉验证与使用它的Kfold方法之间有什么区别,如下面的代码所示:
不使用Kfold:
clf = GridSearchCV(estimator=model, param_grid=parameters, cv=10, scoring='f1_macro')
clf = clf.fit(xOri, yOri)
与Kfold合作:
NUM_TRIALS = 5
for i in range(NUM_TRIALS):
cv = KFold(n_splits=10, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=model, param_grid=parameters, cv=cv, scoring='f1_macro')
clf = clf.fit(xOri, yOri)
正如我从手册中所理解的那样,他们两人将数据分成10个部分,9个用于训练,1个用于验证,但在使用Kfold的示例中......它进行了5次采样过程({{1每次数据混洗后再分成10个部分。我对吗?
答案 0 :(得分:1)
Looks like you're right, ish.
Either KFold or StratifiedKFold are used by GridSearchCV depending if your model is for regression (KFold) or classification (then StratifiedKFold is used).
Since I don't know what your data is like I can't be sure what is being used in this situation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
But the code you have above will repeat the KFold validation 5 times with different random seeds.
Whether that will produe meaningfully different splits of the data? Not sure.