我使用StratifiedShuffleSplit交叉验证器预测波士顿数据集中的房价。当我运行下面的示例代码时。
map.size
我收到以下错误。该代码适用于ShuffleSplit。这意味着StratifiedShuffleSplit不能与数字标签一起使用。
def fit_model_S(labels, features,step, clf,parameters):
cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42)
print (cv)
for train_index, test_index in cv.split(features,labels):
labels_train, labels_test = labels[train_index], labels[test_index]
features_train, features_test = features[train_index], features[test_index]
数据集示例如下。
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-141-b290147edcbf> in <module>()
33 dt_steps = [('decision', clf)]
34
---> 35 fit_model_S(labels, features,dt_steps,clf,parameters4)
36
37
<ipython-input-141-b290147edcbf> in fit_model_S(labels, features, step, clf, parameters)
8 cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42)
9 print (cv)
---> 10 for train_index, test_index in cv.split(features,labels):
11
12 labels_train, labels_test = labels[train_index], labels[test_index]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
1194 """
1195 X, y, groups = indexable(X, y, groups)
-> 1196 for train, test in self._iter_indices(X, y, groups):
1197 yield train, test
1198
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _iter_indices(self, X, y, groups)
1535 class_counts = np.bincount(y_indices)
1536 if np.min(class_counts) < 2:
-> 1537 raise ValueError("The least populated class in y has only 1"
1538 " member, which is too few. The minimum"
1539 " number of groups for any class cannot"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
在这种情况下,MEDV是标签。
答案 0 :(得分:1)
Boston Housing数据是回归问题的数据集。您正在使用StratifiedShuffleSplit
将其划分为火车和测试。 StratifiedShuffleSplit
为mentioned in docs:
这个交叉验证对象是StratifiedKFold和的合并 ShuffleSplit,返回分层随机折叠。折叠是 通过保留每个班级的样本百分比来制作。
请查看最后一行: - &#34;保留每个班级的样本百分比&#34;。因此,StratifiedShuffleSplit
会尝试将y
值视为单个类。
但这是不可能的,因为你的y
是一个回归变量(连续数值数据)。
请查看ShuffleSplit或train_test_split来划分数据。 有关交叉验证的详细信息,请参阅此处:http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation