Question

我正在尝试将TimeSeriesSplit与GridSearchCV结合使用，使用scikit-learn 0.18.1。

相关代码如下所示：

pipeline = Pipeline([
        ('MMS', MinMaxScaler()),
        ('VT', VarianceThreshold(threshold=0.005)),
        ('SKB',SelectKBest(chi2, k=90)),
        ('rf', RandomForestClassifier(class_weight='balanced', random_state=1))])

tscv = TimeSeriesSplit(n_splits=n)
gridsearch = GridSearchCV(pipeline, dict, cv=tscv, n_jobs=1, scoring="roc_auc")
gridsearch.fit(X,y)

X和y的形状是

X.shape == (99942, 2867)
y.shape == (99918,)

对于n=2，这非常合适。但是，当n=3时，我收到以下错误：

IndexError: index 1 is out of bounds for axis 1 with size 1

堆栈跟踪的相关部分是来自sklearn/metrics/scorer.py的这段代码：

y_type = type_of_target(y)
y_pred = clf.predict_proba(X)
if y_type == "binary":
    y_pred = y_pred[:, 1]

发生了什么，我该如何解决？

Answer 1

您的一个测试分割中很可能只有一个课程。

这将显示您的分组中每个类的样本数量：

tscv = TimeSeriesSplit(3)
for i, (train, test) in enumerate(tscv.split(X, y)):
    print("Class occurrences in train split #%d: %s" % 
           (i, np.unique(y[train], return_counts=True)))
    print("Class occurrences in test split #%d: %s" % 
           (i, np.unique(y[test], return_counts=True)))

对于n_splits＆gt; 2，带有GridSearchCV的TimeSeriesSplit失败

1 个答案: