Question

我需要在某些模型上进行K折CV，但是我需要确保将验证（测试）数据集按组和t年数聚集在一起。 GroupKFold已经结束，但是仍然可以拆分验证集（请参阅第二折）。

例如，如果我有一组2000-2008年的数据，而我想将K分为3组。适当的设置是：验证：2000-2002，火车：2003-2008； V：2003-2005，T：2000-2002和2006-2008；和V：2006-2008，T：2000-2005）。

是否有一种方法可以使用K-Fold CV对数据进行分组和聚类，其中验证集按t年聚类？

from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]

gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
    print("Train:", train_index, "Validation:",test_index)

输出：

Train: [ 0  1  2  3  4  5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0  1  2 10 11 12]
Train: [ 0  1  2  6  7  8  9 10 11 12] Validation: [3 4 5]

期望的输出（每组假设2年）：

Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0  1  2  3 4 5 ] Validation: [6 7 8 9 10 11 12]

尽管，测试和训练子集并不是连续的，可以选择更多的年份进行分组。

Answer 1

希望我能正确理解你。

scikits var UserSchema = mongoose.Schema({ name: String, username: { type: String, required: true, unique: true }, password: { type: String, required: true } }); // compile schema to model var User = mongoose.model('User', UserSchema);中的LeaveOneGroupOut方法可能有帮助：

假设您为2000-2002年的所有数据点分配组标签0，为2003年至2005年之间的所有数据点分配标签1，并为2006-2008年的数据分配标签2。然后，您可以使用以下方法来创建训练和测试组，其中三个组是从三个组之一创建的：

model_selection

输出：

from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))

logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
    print("train_idx:", train_index, "test_idx:", test_index)

编辑

我想我现在终于明白了你想要什么。抱歉，我花了这么长时间。

我认为您所需的拆分方法已在sklearn中实现。但是我们可以轻松扩展BaseCrossValidator方法。

n_splits= 3
train_idx: [ 4  5  6  7  8  9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0  1  2  3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]

用法非常简单。和以前一样，我们定义import numpy as np from sklearn.model_selection import BaseCrossValidator from sklearn.utils.validation import check_array class GroupOfGroups(BaseCrossValidator): def __init__(self, group_of_groups): """ :param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used for validation. """ self.group_of_groups = group_of_groups def get_n_splits(self, X=None, y=None, groups=None): return len(self.group_of_groups) def _iter_test_masks(self, X=None, y=None, groups=None): if groups is None: raise ValueError("The 'groups' parameter should not be None.") groups=check_array(groups, copy=True, ensure_2d=False, dtype=None) for g in self.group_of_groups: test_index = np.zeros(len(groups), dtype=np.bool) for g_id in g: test_index[groups == g_id] = True yield test_index和X,y。另外，我们定义了一个列表列表（一组组），这些列表定义了哪些组应在哪个测试折叠中一起使用。因此groups意味着第一组中的第1组和第2组用作测试集，而其余3和4组则用于训练。在第二张中，将第2组和第3组的数据用作测试集等。

我对“ GroupOfGroups”的命名并不满意，所以也许您会发现更好的东西。

现在我们可以测试此交叉验证器：

g_of_g=[[1,2],[2,3],[3,4]]

输出：

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
    print("train_idx:", train_index, "test_idx:", test_index)

请记住，我没有进行很多检查，也没有进行全面的测试。因此，请仔细验证它是否适合您。

使用Sklearn的组/集群K折CV

1 个答案:

编辑