Question

尝试使用GridSearchCV或RandomizedSearchCV填充我的训练数据时，我不断收到以下错误：

TypeError：类型不支持转换：（dtype（＆＃39; O＆＃39;），dtype（＆＃39; O＆＃39;））

以下是相关代码的示例：

from xgboost.sklearn import XGBRegressor as XGR
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

xgbRegModel = XGR()
params = {'max_depth':[3, 6, 9], 'learning_rate':[.05, .1, .5], 'n_estimators': [50, 100, 200]}

rscv = RandomizedSearchCV(xgbRegModel, params)  
rscv.fit(X, y)  
rscv.best_model_

其中X是（39942,11257）scipy.sparse.csr.csr_matrix而y是（39942，）numpy.ndarray。

dtypes都是int64或float64，我尝试使用np.nan值并在将np.nan值填充为0后运行它。 ..（我认为这可能是问题，但不是。）

谁能告诉我这里发生了什么？当我在不使用GridSearchCV或RandomizedSearchCV的情况下训练模型时，它可以正常工作。

任何想法都将不胜感激 - 谢谢！

ps - 错误的追溯真的很长，但是如果有帮助的话，就在这里..

TypeError                                 Traceback (most recent call last)
<ipython-input-54-63d54d4cd03e> in <module>()
      3 xgbRegModel = XGR()
      4 rscv = RandomizedSearchCV(xgbRegModel, params)
----> 5 rscv.fit(X, y)
      6 rscv.best_model_

~\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    636                                   error_score=self.error_score)
    637           for parameters, (train, test) in product(candidate_params,
--> 638                                                    cv.split(X, y, groups)))
    639 
    640         # if one choose to see train score, "out" will contain train  score info

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
    780                 self._iterating = True
    781             else:

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
    626                 return True
    627 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
    589         self._jobs.append(job)
    590 

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
    112         if callback:
    113             callback(result)

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
    333 
    334     def get(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in     <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)
    425     start_time = time.time()
    426 
--> 427     X_train, y_train = _safe_split(estimator, X, y, train)
    428     X_test, y_test = _safe_split(estimator, X, y, test, train)
    429 

~\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in _safe_split(estimator, X, y, indices, train_indices)
    198             X_subset = X[np.ix_(indices, train_indices)]
    199     else:
--> 200         X_subset = safe_indexing(X, indices)
    201 
    202     if y is not None:

~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in safe_indexing(X, indices)
    160             return X.take(indices, axis=0)
    161         else:
--> 162             return X[indices]
    163     else:
    164         return [X[idx] for idx in indices]

~\Anaconda3\lib\site-packages\scipy\sparse\csr.py in __getitem__(self, key)
    315             if isintlike(col) or isinstance(col,slice):
    316                 P = extractor(row, self.shape[0])     # [[1,2],j] or [[1,2],1:2]
--> 317                 extracted = P * self
    318                 if col == slice(None, None, None):
    319                     return extracted

~\Anaconda3\lib\site-packages\scipy\sparse\base.py in __mul__(self, other)
    367             if self.shape[1] != other.shape[0]:
    368                 raise ValueError('dimension mismatch')
--> 369             return self._mul_sparse_matrix(other)
    370 
    371         # If it's a list or whatever, treat it like a matrix

~\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in _mul_sparse_matrix(self, other)
    539         indptr = np.asarray(indptr, dtype=idx_dtype)
    540         indices = np.empty(nnz, dtype=idx_dtype)
--> 541         data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype))
    542 
    543         fn = getattr(_sparsetools, self.format + '_matmat_pass2')

~\Anaconda3\lib\site-packages\scipy\sparse\sputils.py in upcast(*args)
     49             return t
     50 
---> 51     raise TypeError('no supported conversion for types: %r' % (args,))
     52 
     53 

TypeError: no supported conversion for types: (dtype('O'), dtype('O'))

Answer 1

这是因为GridSearchCV在fit（）方法中不支持稀疏矩阵。请查看fit method here的签名：

参数：

X : array-like, shape = [n_samples, n_features]

正如您所看到的那样，只支持类似数组的输入。

至于为什么它在没有网格搜索的情况下正常工作，那是因为XGBRegressor支持稀疏矩阵。

在交叉验证期间出现实际错误，X被分成列和测试，这对于稀疏矩阵不像普通数组那样。

另外，请确保对于XGBRegressor，稀疏矩阵的类型为CSC而不是CSR，因为它会给你错误的结果。其描述如下：https://github.com/dmlc/xgboost/issues/1238

使用GridSearchCV和RandomizedSearchCV

1 个答案: