我正在尝试运行以下代码:
from sklearn.model_selection import StratifiedKFold
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]
skf = StratifiedKFold(n_splits=10)
for train, test in skf.split(X,y):
print("%s %s" % (train,test))
但是我收到以下错误:
ValueError: n_splits=10 cannot be greater than the number of members in each class.
我看过这里scikit-learn error: The least populated class in y has only 1 member,但我仍然不确定我的代码有什么问题。
我的列表长度均为14 print(len(X))
print(len(y))
。
我感到困惑的部分原因是我不确定members
的定义是什么以及class
在这种背景下是什么。
问题:如何修复错误?什么是会员?什么是课程? (在此背景下)
答案 0 :(得分:7)
分层意味着保持每个级别中每个级别的比例。因此,如果您的原始数据集有3个类别,比例分别为60%,20%和20%,那么分层将尝试在每个折叠中保持该比率。
在你的情况下,
X = ["hey", "join now", "hello", "join today", "join us now", "not today",
"join this trial", " hey hey", " no", "hola", "bye", "join today",
"no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]
您总共有14个样本(成员),分发:
class number of members percentage
'n' 9 64
'r' 3 22
'y' 2 14
所以StratifiedKFold将尝试在每个折叠中保持这个比例。现在你指定了10倍(n_splits)。所以这意味着在一个单一的折叠中,对于班级来说,' y'保持比例,至少2/10 = 0.2成员。但是我们不能给少于1个成员(样本),这就是为什么它会在那里抛出错误。
如果不是n_splits=10
,而是设置了n_splits=2
,那么它就会有效,因为而不是“' y”的成员数量。将是2/2 = 1.要使n_splits = 10
正常工作,您需要为每个班级至少提供10个样本。