鉴于我正在运行5倍交叉验证的大型数据框,我如何在列车和测试阵列中存储每个折叠。
请参阅此处的scikit-learn文档:http://scikit-learn.org/stable/modules/cross_validation.html
以下是他们给出的例子:
>>> import numpy as np
>>> from sklearn.model_selection import KFold
>>> X = ["a", "b", "c", "d"]
>>> kf = KFold(n_splits=2)
>>> for train, test in kf.split(X):
... print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]
Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set. Thus, one can create the training/test sets using numpy indexing:
>>>
>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> y = np.array([0, 1, 0, 1])
>>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
我的数据框有数千个值,但我想存储这样的值:
V_train,V_test,W_train,W_test,X_train,X_test,Y_train,Y_test,Z_train,Z_test
答案 0 :(得分:1)
您可以执行以下操作:
X = pd.DataFrame() # here should be your initial DataFrame with more than 5 rows
kf = KFold(n_splits=5)
((V_train_ids, V_test_ids),
(W_train_ids, W_test_ids),
(X_train_ids, X_test_ids),
(Y_train_ids, Y_test_ids),
(Z_train_ids, Z_test_ids)) = list(kf.split(X))
编辑:
之后,您将获得指定折叠的列车和测试部件的索引。要获得训练和测试对象,您可以通过以下索引访问它们:
((V_train, V_test),
(W_train, W_test),
(X_train, X_test),
(Y_train, Y_test),
(Z_train, Z_test)) = ((X[V_train_ids], X[V_test_ids]),
(X[W_train_ids], X[W_test_ids]),
(X[X_train_ids], X[X_test_ids]),
(X[Y_train_ids], X[Y_test_ids]),
(X[Z_train_ids], X[Z_test_ids]))