我有一个sparse.csr_matrix。它由三个连接的矩阵组成,其中一个最初是csr,另外两个是从密集矩阵转换而来。
在数据的SUBSET上运行sklearn.ensemble.RandomForestClassifier时(但不是全部),我收到错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
但是:检查,我发现:
np.isnan(matrix.data).any() # => False (there are no NaNs)
np.isfinite(matrix.data).all() # => True (There are no infinite values)
np.max(matrix.data) # => 10499 (certainly not too big for floats)
对于完整数据和子集,表明错误不正确,问题出在其他地方 - 但在哪里,为什么,我仍然无法分辨。有没有人见过这个?
图表1:repr(matrix) = "<12785x190428 sparse matrix of type '<type 'numpy.float64'>'\n\twith 2825051 stored elements in Compressed Sparse Row format>"
图表2:错误堆栈
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-70-649153b97fe0> in <module>()
8 lower = upper
9
---> 10 m = rf.fit(everything[train,:], data.label[train])
11 yhat = m.predict(everything[test,:])
12 print(np.mean(yhat==data.label[test]))
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
246 # Validate or convert input data
247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
--> 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if issparse(X):
250 # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
405 % (array.ndim, estimator_name))
406 if force_all_finite:
--> 407 _assert_all_finite(array)
408
409 shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').