我正在尝试训练我的模型,并使用sklearn的交叉验证对它们进行验证。我想做的是在所有模型中使用相同的折叠(将从不同的python脚本运行)。
我该怎么做?我应该将它们保存到文件中吗?还是应该保存kfold模型?还是应该使用相同的种子?
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
答案 0 :(得分:0)
我发现保存折叠的最简单方法是,通过循环遍历,简单地从分层k折叠方法中获取折叠。然后将其存储到json文件:
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
folds = {}
count = 1
for train, test in kfold.split(np.zeros(len(y)), y.argmax(1)):
folds['fold_{}'.format(count)] = {}
folds['fold_{}'.format(count)]['train'] = train.tolist()
folds['fold_{}'.format(count)]['test'] = test.tolist()
count += 1
print(len(folds) == n_splits)#assert we have the same number of splits
#dump folds to json
import json
with open('folds.json', 'w') as fp:
json.dump(folds, fp)
注释1 :此处使用Argmax是因为我的y值是一个热门变量,因此我们需要获取可预测/真实的类。
现在可以从任何其他脚本加载它:
#load to dict to be used
with open('folds.json') as f:
kfolds = json.load(f)
从这里我们可以轻松地遍历字典中的元素:
for key, val in kfolds.items():
print(key)
train = val['train']
test = val['test']
我们的json文件如下所示:
{"fold_1": {"train": [193, 2405, 2895, 565, 1215, 274, 2839, 1735, 2536, 1196, 40, 2541, 980,...SNIP...830, 1032], "test": [1, 5, 6, 7, 10, 15, 20, 26, 37, 45, 52, 54, 55, 59, 60, 64, 65, 68, 74, 76, 78, 90, 100, 106, 107, 113, 122, 124, 132, 135, 141, 146,...SNIP...]}