difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

时间:2017-08-30 20:43:29

标签: python scikit-learn cross-validation

As from the title I am wondering what is the difference between

StratifiedKFold with the parameter shuffle = True

foreach ($request->type as $value) {
    $country = Country::create([
        // don't know where you should put id, but
        'id_key' => $value['id'],
        'country' => $value['name'],
    ]);
}

and

StratifiedShuffleSplit

.song{/*DivButtonTitle*/
margin-bottom:3px;
color:#551A8B;
font-size:35px;
font-family:'flat';
height:80px;
border-top:solid 1px #551A8B;
border-bottom:solid 1px #551A8B;
width:100%;
line-height: 80px;
padding:0px;
}

and what is the advantage of using StratifiedShuffleSplit

3 个答案:

答案 0 :(得分:39)

在KFolds中,即使是随机播放,每个测试集也不应重叠。使用KFolds和shuffle,数据在开始时被洗牌一次,然后被分成所需分割的数量。测试数据总是其中一个分裂,列车数据是其余部分。

在ShuffleSplit中,每次都会对数据进行混洗,然后进行拆分。这意味着测试集可能在分裂之间重叠。

请参阅此块以获取差异示例。注意ShuffleSplit的测试集中元素的重叠。

splits = 5

tx = range(10)
ty = [0] * 5 + [1] * 5

from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn import datasets

kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)

print("KFold")
for train_index, test_index in kfold.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

print("Shuffle Split")
for train_index, test_index in shufflesplit.split(tx, ty):
    print("TRAIN:", train_index, "TEST:", test_index)

输出:

KFold
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
Shuffle Split
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]

至于何时使用它们,我倾向于使用KFolds进行任何交叉验证,并且我使用ShuffleSplit以2分为我的火车/测试集拆分。但我确信两者都有其他用例。

答案 1 :(得分:23)

@Ken Syme已经有了一个很好的答案。我只想添加一些东西。

  • StratifiedKFold KFold的变体。首先,StratifiedKFold将您的数据混洗,然后将数据分为n_splits部分并完成。 现在,它将把每个零件用作测试集。请注意,仅它一次并且总是将数据随机洗一次,然后再进行拆分。

使用shuffle = True,数据将被random_state混排。除此以外, 数据将被np.random(默认)改组。 例如,对于n_splits = 4,您的数据具有y(因变量)的3个类(标签)。 4个测试集覆盖了所有数据,没有任何重叠。

enter image description here

  • 另一方面, StratifiedShuffleSplit ShuffleSplit的变体。 首先,StratifiedShuffleSplit将您的数据混洗,然后还将数据拆分为n_splits个部分。但是,还没有完成。完成此步骤后,StratifiedShuffleSplit选取一个零件用作测试集。 然后它会重复n_splits - 1相同的过程,以获得n_splits - 1其他测试集。看看下面的图片,它们具有相同的数据,但是这次,这4个测试集没有覆盖所有数据,即测试集之间存在重叠。

enter image description here

因此,这里的区别在于StratifiedKFold 只是随机播放和拆分一次,因此测试集不会重叠,而StratifiedShuffleSplit 每次随机播放都在拆分之前,并且拆分n_splits次,测试集可以重叠

  • 注意:这两种方法都使用“分层折叠”(这就是为什么在两个名称中都出现“分层”的原因)。这意味着每个部分保留与原始数据相同百分比的每个类别(标签)的样本。您可以在cross_validation documents
  • 阅读更多内容

答案 2 :(得分:0)

图片表示形式: output examples of KFold, StratifiedKFold, StratifiedShuffleSplit (如何在此窗口中显示此图片?)

以上图片表示基于Ken Syme的代码:

from sklearn.model_selection import KFold, StratifiedKFold, StratifiedShuffleSplit
SEED = 43
SPLIT = 3

X_train = [0,1,2,3,4,5,6,7,8]
y_train = [0,0,0,0,0,0,1,1,1]   # note 6,7,8 are labelled class '1'

print("KFold, shuffle=False (default)")
kf = KFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("KFold, shuffle=True")
kf = KFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in kf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedKFold, shuffle=False (default)")
skf = StratifiedKFold(n_splits=SPLIT, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("StratifiedKFold, shuffle=True")
skf = StratifiedKFold(n_splits=SPLIT, shuffle=True, random_state=SEED)
for train_index, test_index in skf.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedShuffleSplit")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=3)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)

print("\nStratifiedShuffleSplit (can customise test_size)")
sss = StratifiedShuffleSplit(n_splits=SPLIT, random_state=SEED, test_size=2)
for train_index, test_index in sss.split(X_train, y_train):
    print("TRAIN:", train_index, "TEST:", test_index)