为什么cross_val_scores运作良好,但cv.splits显示错误?

时间:2019-08-26 03:01:52

标签: python machine-learning scikit-learn

我尝试了cross_val_score。它没有显示错误。

但是如果我尝试使用cv.split,则会显示错误

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits = 5)

clf = RandomForestClassifier(n_estimators=500, max_depth = 10, random_state=100, n_jobs = -1)
for train, val in cv.split(X, y):
    clf.fit(X.iloc[train], y[train])
FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-143-9c8fe6b057e9> in <module>
      1 for train, val in cv.split(X, y):
----> 2     clf.fit(X.iloc[train], y[train])

~\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
    248         # Validate or convert input data
    249         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
--> 250         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    251         if sample_weight is not None:
    252             sample_weight = check_array(sample_weight, ensure_2d=False)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
     58     elif X.dtype == np.dtype('object') and not allow_nan:
     59         if _object_dtype_isnan(X).any():
---> 60             raise ValueError("Input contains NaN")
     61 
     62 

ValueError: Input contains NaN

我通过np.sum(X.isnull())检查了NaN,但数据没有NaN

但是它在下面很好用!

for train, val in cv.split(X.iloc[:200000, ], y[:200000]):
    clf.fit(X.iloc[train, ], y[train])

我更改了索引,但是之前显示了相同的错误。

for train, val in cv.split(X.iloc[:400000, ], y[:400000]):
    clf.fit(X.iloc[train, ], y[train])
FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
.
.
.
ValueError: Input contains NaN

我再次更改索引,效果很好!

for train, val in cv.split(X.iloc[200000:400000, ], y[200000:400000]):
    clf.fit(X.iloc[train, ], y[train])

我该怎么办?

1 个答案:

答案 0 :(得分:0)

简短的答案:在y中,第20万行和第40万行之间有一个NaN,可能不太接近40万。

详细答案: 您应该检查np.sum(y.isnull())而不是np.sum(X.isnull()), 因为回溯显示y中存在NaN。那里会有NaN。

显示的交叉验证检查不能保证y很好, 因为似乎对最后一个函数有误解:cv.split()将索引返回到您提供给它的数组。在

  • cv.split(X.iloc[:200000, ], y[:200000])
  • cv.split(X.iloc[200000:400000, ], y[200000:400000])

数组具有相同的行数,因此返回相同的索引。

for train, val in cv.split(X.iloc[200000:400000, ], y[200000:400000]):
    clf.fit(X.iloc[train, ], y[train])

您实际上访问了X中的0-200000行。要访问200000-400000行,您可以

for train, val in cv.split(X.iloc[200000:400000, ], y[200000:400000]):
    clf.fit(X.iloc[200000:400000, ][train, ], y[200000:400000][train])

我怀疑,如果这样做,该错误将再次出现。顺便说一句,TimeSeriesSplit并没有使用所有数据进行训练,请参见here。由于您仅显示适合分类器,但没有预测,因此当NaN非常接近y向量的结尾时,不会出现错误。时间序列中的最后观测值仅用于测试,而不用于训练。