Question

我正在使用自举技术来评估MLPClassifier，并且我正在使用scikit.utils.resample来获取不同的随机样本，但是x_test和y_test返回的是空的：

seeds = [50,51,52,53,54]
for i in range(5): # number of bootstrap samples
    X_train, y_train = resample(X, y, n_samples=len(X), random_state=seeds[i], stratify=y)
    X_test = [x for x in X if x not in X_train] # test = samples that weren't selected for train
    y_test = [y for y in y if y not in y_train] # test = samples that weren't selected for train

    X_test
    # []

我在做什么错？ /有更好的方法吗？很难相信sklearn没有提供更好的方法。

Answer 1

由于in运算符不适用于2D numpy数组，因此您的第一个列表推导将在这里不起作用。

让我们首先用虚拟数据重现您的问题：

from sklearn.utils import resample
import numpy as np

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])

X_train, y_train = resample(X, y, random_state=0)
X_train
# result
array([[ 1.,  0.],
       [ 2.,  1.],
       [ 1.,  0.]])

到目前为止一切都很好；但是，正如我说的那样，列表理解将不起作用，因为您已经发现自己了：

X_test = [x for x in X if x not in X_train]
X_test
# []

原因是in运算符不适用于2D numpy数组。

将您的首字母X转换为列表即可解决该问题：

X = X.tolist()

X_train, y_train = resample(X, y, random_state=0)
X_train
# [[1.0, 0.0], [2.0, 1.0], [1.0, 0.0]] # as previous result
X_test = [x for x in X if x not in X_train]
X_test
# [[0.0, 0.0]]

在预期的情况下，我们在X_test中获得了X中不存在的初始X_train的唯一元素，即[[0.0, 0.0]]。

相反，y是一维numpy数组，列表推导中的in运算符将起作用：

y_test = [y for y in y if y not in y_train]
y_test
# [2]

重新采样-无法单独创建训练和测试集

1 个答案: