Question

我正在尝试运行以下代码：

from sklearn.model_selection import StratifiedKFold 
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]

skf = StratifiedKFold(n_splits=10)

for train, test in skf.split(X,y):  
    print("%s %s" % (train,test))

但是我收到以下错误：

ValueError: n_splits=10 cannot be greater than the number of members in each class.

我看过这里scikit-learn error: The least populated class in y has only 1 member，但我仍然不确定我的代码有什么问题。

我的列表长度均为14 print(len(X)) print(len(y))。

我感到困惑的部分原因是我不确定members的定义是什么以及class在这种背景下是什么。

问题：如何修复错误？什么是会员？什么是课程？（在此背景下）

Answer 1

分层意味着保持每个级别中每个级别的比例。因此，如果您的原始数据集有3个类别，比例分别为60％，20％和20％，那么分层将尝试在每个折叠中保持该比率。

在你的情况下，

X = ["hey", "join now", "hello", "join today", "join us now", "not today",
     "join this trial", " hey hey", " no", "hola", "bye", "join today", 
     "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]

您总共有14个样本（成员），分发：

class    number of members         percentage
 'n'        9                        64
 'r'        3                        22
 'y'        2                        14

所以StratifiedKFold将尝试在每个折叠中保持这个比例。现在你指定了10倍（n_splits）。所以这意味着在一个单一的折叠中，对于班级来说，＆＃39; y＆＃39;保持比例，至少2/10 = 0.2成员。但是我们不能给少于1个成员（样本），这就是为什么它会在那里抛出错误。

如果不是n_splits=10，而是设置了n_splits=2，那么它就会有效，因为而不是“＆＃39; y”的成员数量。将是2/2 = 1.要使n_splits = 10正常工作，您需要为每个班级至少提供10个样本。

ValueError：n_splits = 10不能大于每个类中的成员数

1 个答案: