Question

此错误中有大量样本，其中的问题与数组的大小或数据帧的读取方式有关。但是，我只使用X和Y的python列表。

我正在尝试将我的代码拆分为火车并使用train_test_split进行测试。

我的代码是：

X, y = file2vector(corpus_dir)
assert len(X) == len(y) # both lists same length
print(type(X))
print(type(y))
seed = 123
labels = list(set(y))
print(len(labels))
print(labels)
cont = {}
for l in y:
    if not l in cont:
        cont[l] = 1
    else:
        cont[l] += 1

print(cont)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels)

输出是：

<class 'list'> # type(X)
<class 'list'> # type(y)
2 # len(labels)
['I', 'Z'] # labels
{'I': 18867, 'Z': 13009} # cont

X和y只是我从带有file2vector的文件中读取的Python字符串的Python列表。我在python 3上运行，回溯如下：

Traceback (most recent call last):
  File "/home/rodrigo/idatha/no_version/imm/classifier.py", line 28, in <module> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=labels)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 2056, in train_test_split train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/model_selection/_split.py", line 1203, in split X, y, groups = indexable(X, y, groups)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 229, in indexable check_consistent_length(*result)
  File "/home/rodrigo/idatha/no_version/imm/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 204, in check_consistent_length " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [31876, 2]

Answer 1

问题在于您的labels列表。在内部向stratify提供train_test_split时，该值将作为y参数传递给split实例的StratifiedShuffleSplit方法。正如您在文档中看到的split方法y应该与X的长度相同（在这种情况下是您希望拆分的数组）。因此，为了解决您的问题而不是传递stratify=labels，请使用stratify=y

Answer 2

使用Python 3.7（scikit-learn 0.21.2）上的图8不平衡灾难消息数据集，即使使用stratify = y，我也遇到train_test_split相同的问题。对我来说，解决方案是设置stratify=y.iloc[:,1]之前的参数y = df[df.columns[4:]]。也许这对其他人也有帮助...

ValueError：找到具有不一致样本数的输入变量

2 个答案: