我尝试了一种在火车和测试集之间分割数据的方法,但它似乎用零填充火车并将数据留在测试中......
当我应用以下函数随机选择给定数组的某些列时,它使用带有numpy矩阵的DataLens但不与其他列一起工作。
def train_test_split(array):
test = np.zeros(array.shape)
train = array.copy()
for user in xrange(array.shape[0]):
test_ratings = np.random.choice(array[user, :].nonzero()[0],
size=10,
replace=False)
train[user, test_ratings] = 0.
test[user, test_ratings] = ratings[user, test_ratings]
# Test and training are truly disjoint
assert(np.all((train * test) == 0))
return train, test
train, test = train_test_split(ratings)
使用简单数据时:
ratings :
[[ 1. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 1.]]
即使火车最初是评级的副本,它也会逐一填充数组:
train :
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]