我想在RandomForestClassifier中执行GridSearchCV,但数据不平衡,所以我使用StratifiedKFold:
from sklearn.model_selection import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators':[10, 30, 100, 300], "max_depth": [3, None],
"max_features": [1, 5, 10], "min_samples_leaf": [1, 10, 25, 50], "criterion": ["gini", "entropy"]}
rfc = RandomForestClassifier()
clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)
但是我收到了一个错误:
TypeError Traceback (most recent call last)
<ipython-input-597-b08e92c33165> in <module>()
9 rfc = RandomForestClassifier()
10
---> 11 clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train)
c:\python34\lib\site-packages\sklearn\grid_search.py in fit(self, X, y)
811
812 """
--> 813 return self._fit(X, y, ParameterGrid(self.param_grid))
c:\python34\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable)
559 self.fit_params, return_parameters=True,
560 error_score=self.error_score)
--> 561 for parameters in parameter_iterable
562 for train, test in cv)
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
601
602 with self._lock:
--> 603 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
604 if len(tasks) == 0:
605 # No more tasks available in the iterator: tell caller to stop.
c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
c:\python34\lib\site-packages\sklearn\grid_search.py in <genexpr>(.0)
560 error_score=self.error_score)
561 for parameters in parameter_iterable
--> 562 for train, test in cv)
563
564 # Out is a list of triplet: score, estimator, n_test_samples
TypeError: 'StratifiedKFold' object is not iterable
当我写cv=StratifiedKFold(y_train)
时我有ValueError: The number of folds must be of Integral type.
但是当我写`cv = 5时,它有效。
我不明白StratifiedKFold有什么问题
答案 0 :(得分:5)
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
然后它应该可以正常工作。
答案 1 :(得分:2)
似乎应将cv=StratifiedKFold()).fit(X_train, y_train)
更改为cv=StratifiedKFold()).split(X_train, y_train).
答案 2 :(得分:0)
api在最新版本中发生了变化。您曾经传递y,现在只需在创建stratifiedKFold对象时传递数字。你稍后通过了。
答案 3 :(得分:0)
这里的问题是其他答案中提到的API更改,但答案可能更明确。
cv
参数文档说明:
cv:int,交叉验证生成器或可迭代的可选
确定交叉验证拆分策略。可能的输入 对于简历是:
无,使用默认的3倍交叉验证,整数, 指定折叠数。
要用作的对象 交叉验证生成器。
可迭代的火车/测试分裂。
对于整数/无输入,如果y是二进制或多类,则为StratifiedKFold 用过的。如果估计量是分类器,或者y既不是二进制也不是 多类,使用KFold。
因此,无论使用cross validation strategy,所需的只是使用函数split
提供生成器,如下所示:
kfolds = StratifiedKFold(5)
clf = GridSearchCV(estimator, parameters, scoring=qwk, cv=kfolds.split(xtrain,ytrain))
clf.fit(xtrain, ytrain)