我正在尝试使用StratifiedKFold
创建train / test / val拆分,以便在非sklearn机器学习工作流程中使用。因此,需要拆分DataFrame然后保持这种状态。
我正在尝试使用.values
执行此操作,因为我正在传递pandas DataFrames:
skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)
for train_index, test_index, valid_index in skf.split(X.values, y.values):
print("TRAIN:", train_index, "TEST:", test_index, "VALID:", valid_index)
X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]
这失败了:
ValueError: not enough values to unpack (expected 3, got 2).
我阅读了所有sklearn
文档并运行了示例代码,但没有更好地理解如何在sklearn
交叉验证方案之外使用分层k折叠拆分。
编辑:
我也尝试过这样:
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)
这似乎有效,尽管我认为这样做会弄乱分层。
答案 0 :(得分:2)
StratifiedKFold只能用于将数据集分成两部分。您收到错误,因为split()
方法只会产生train_index和test_index的元组(请参阅https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94)。
对于这个用例,您应首先将数据拆分为验证和休息,然后再将其余部分拆分为测试和训练,如下所示:
X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify='column')
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify='column')
答案 1 :(得分:0)
在stratify
参数中,传递目标以进行分层。首先,通知完整的目标数组(在我的情况下为y
)。然后,在下一个拆分中,通知已拆分的目标(在我的情况下为y_train
)
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)
答案 2 :(得分:0)
我不确定这个问题是关于KFold还是只是分层拆分,但是我为StratifiedKFold编写了这个快速包装,并带有交叉验证集。
from sklearn.model_selection import StratifiedKFold, train_test_split
class StratifiedKFold3(StratifiedKFold):
def split(self, X, y, groups=None):
s = super().split(X, y, groups)
for train_indxs, test_indxs in s:
y_train = y[train_indxs]
train_indxs, cv_indxs = train_test_split(train_indxs,stratify=y_train, test_size=(1 / (self.n_splits - 1)))
yield train_indxs, cv_indxs, test_indxs
可以这样使用:
X = np.random.rand(100)
y = np.random.choice([0,1],100)
g = KFold3(10).split(X,y)
train, cv, test = next(g)
train.shape, cv.shape, test.shape
>> ((80,), (10,), (10,))