使用python将数组拆分为训练集和测试集

时间:2017-06-23 15:36:27

标签: python numpy matrix random

我尝试了一种在火车和测试集之间分割数据的方法,但它似乎用零填充火车并将数据留在测试中......

从理论上讲,它有效:

当我应用以下函数随机选择给定数组的某些列时,它使用带有numpy矩阵的DataLens但不与其他列一起工作。

def train_test_split(array):
    test = np.zeros(array.shape)
    train = array.copy()
    for user in xrange(array.shape[0]):
        test_ratings = np.random.choice(array[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]

    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

train, test = train_test_split(ratings)

使用简单的数据,它不起作用:

使用简单数据时:

ratings :
[[ 1.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  1.]]

即使火车最初是评级的副本,它也会逐一填充数组:

train :  
 [[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

0 个答案:

没有答案