如何将数据集拆分/分区为训练和测试数据集,例如交叉验证?

时间:2010-09-09 06:57:34

标签: python arrays optimization numpy

将NumPy数组随机分成训练和测试/验证数据集的好方法是什么?类似于Matlab中的cvpartitioncrossvalind函数。

12 个答案:

答案 0 :(得分:95)

如果您想将数据集分成两半,如果需要跟踪索引,可以使用numpy.random.shufflenumpy.random.permutation

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

repeatedly partition the same data set for cross validation的方法有很多种。一种策略是从数据集重新采样,重复:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后,sklearn包含several cross validation methods(k-fold,leave-n-out,...)。它还包括更高级的"stratified sampling"方法,这些方法创建了与某些功能相关的数据分区,例如,以确保在训练和测试集中存在相同比例的正面和负面示例。

答案 1 :(得分:41)

还有另一种选择需要使用scikit-learn。如scikit's wiki describes,您只需使用以下说明:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

通过这种方式,您可以将您尝试拆分的数据标签与培训和测试保持同步。

答案 2 :(得分:32)

请注意。如果您需要训练,测试和验证集,您可以这样做:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将提供70%的训练,15%用于测试和val组。希望这可以帮助。

答案 3 :(得分:9)

由于sklearn.cross_validation模块已弃用,您可以使用:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

答案 4 :(得分:4)

您也可以考虑将分层划分为训练和测试集。 Startized division还会随机生成训练和测试集,但这样可以保留原始的比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出:

[1 2 3]
[1 2 3]

答案 5 :(得分:0)

我为自己的项目编写了一个函数来执行此操作(但它不使用numpy):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

如果你想让这些块被随机化,只需在传入之前对列表进行洗牌。

答案 6 :(得分:0)

这是一个以分层方式将数据分成n = 5倍的代码

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

答案 7 :(得分:0)

感谢pberkes的回答。我刚修改它以避免(1)取样时取样(2)在训练和测试中都发生了重复的实例:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

答案 8 :(得分:0)

在阅读并考虑了(许多..)将数据拆分以进行训练和测试的不同方式之后,我决定计时了!

我使用了4种不同的方法(没有一个使用sklearn库,我相信它将得到最好的结果,因为它是经过精心设计和测试的代码):

  1. 随机整理整个矩阵arr,然后拆分数据以进行训练和测试
  2. 随机排序索引,然后将其分配给x和y以拆分数据
  3. 与方法2相同,但以更有效的方式进行
  4. 使用熊猫数据框进行拆分
方法3远远落后于方法1,而方法2和方法4则效率最低。

我为4种不同方法计时的代码:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

在这段时间内,执行1000次循环的3次重复的最短时间为:

  • 方法1:0.35883826200006297秒
  • 方法2:1.7157016959999964秒
  • 方法3:1.7876616719995582秒
  • 方法4:0.07562861499991413秒

我希望对您有所帮助!

答案 9 :(得分:0)

就像您一样,您不仅需要拆分训练和测试,而且还需要交叉验证以确保您的模型能够概括。 在这里,我假设70%的培训数据,20%的确认数据和10%的坚持/测试数据。

签出np.split

  

如果index_or_sections是一维排序的整数数组,则条目   指示将数组沿轴拆分的位置。例如[2,3]   对于轴= 0,将导致

     

ary [:2] ary [2:3] ary [3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))]) 

答案 10 :(得分:0)

分成火车测试并且有效

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

答案 11 :(得分:0)

我知道我的解决方案不是最好的,但是当您想以一种简单的方式拆分数据时,特别是在向新手教授数据科学时,它会派上用场!

def simple_split(descriptors, targets):
    testX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 0]
    validX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 1]
    trainX_indices = [i for i in range(descriptors.shape[0]) if i % 4 >= 2]

    TrainX = descriptors[trainX_indices, :]
    ValidX = descriptors[validX_indices, :]
    TestX = descriptors[testX_indices, :]

    TrainY = targets[trainX_indices]
    ValidY = targets[validX_indices]
    TestY = targets[testX_indices]

    return TrainX, ValidX, TestX, TrainY, ValidY, TestY

根据此代码,数据将分为三部分-测试部分为1/4,验证部分为1/4,训练集为2/4。