Question

假设我有一个包含1000行的数据集。我想将其分为训练和测试集。我想先将800行拆分为训练集，然后将200行拆分为测试集。有可能吗？

用于训练和测试拆分的Python代码如下：

from sklearn.cross_validation import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)

Answer 1

有多种方法可以做到这一点，我将以其中的几种为例。

切片是python中一个功能强大的方法，如果您只想获取前800个副本，并且将数据框命名为data[start:stop:step]，则根据情况将参数接受为train对于输入功能，Y对于输出功能，可以使用

X_train = train[0:800]
X_test = train[800:]
y_train = Y[0:800]
y_test = Y[800:]

Iloc 函数与dataFrame关联并且与索引关联，如果您的索引为数字，则可以使用

X_train = train.iloc[0:800]
X_test = train.iloc[800:]
y_train = Y.iloc[0:800]
y_test = Y.iloc[800:]

如果只需要将数据分成两部分，甚至可以使用df.head()和df.tail()来完成，

X_train = train.head(800)
X_test = train.tail(200)
y_train = Y.head(800)
y_test = Y.tail(200)

还有其他方法可以做到这一点，我建议使用第一种方法，因为它在多个数据类型之间是通用的，如果使用numpy数组，也可以使用。要了解有关切片的更多信息，建议您结帐。 Understanding slice notation在此为列表进行了说明，但几乎适用于所有形式。