Question

我只是在sklearn的train_test_split函数中尝试使用“分层”参数。我的数据集处于不平衡状态，以下是类的比例：

0级：8902 第1类：1,605

第1类占数据集的15％。

这是不使用分层的默认拆分：

x_train, x_test, y_train, y_test = train_test_split(df['image'], df['class'], test_size=0.2,random_state=5)

Training set balance:
0    7,116
1    1,289

Test set balance:
0    1,786
1     316

在下面，我使用分层：

x_train, x_test, y_train, y_test = train_test_split(df['image'], df['class'], test_size=0.2,random_state=5,stratify=df['class'])

Training set balance:
0    7121
1    1284

Test set balance:
0    1781
1     321

这两者的比例大致相同：1类为18％。添加“分层”没有任何作用。

所以这让我有些困惑。我在做错什么吗？

谢谢

Answer 1

添加stratify将确保1的比例与原始数据相同。

计算比例1：

原始：

Total:  print(1605/(1605+8902)) = 0.1527553059864852

没有stratify ：

Train:  print(1289/(1289+7116)) = 0.1533610945865556
Test:   print(316/(316+1786)) = 0.15033301617507136

如您所见，1的比例与原始数据不同，当您再次采样时，该比例可能会有所不同！（这是相似的，因为它是随机抽样的）

带有分层：

Train:  print(1284/(1284+7121)) = 0.15276621058893516
Test:   print(321/(321+1781)) = 0.1527117031398668

与原始数据相同，即使再次采样，比例也不会改变。如此分层，不是吗？

在train_test_split中使用'stratify'没有什么区别。这有什么用途？

1 个答案: