我一直在使用Lenskit's user_partition function来生成交叉验证的k倍。此函数接受3个参数,数据,分区(要生成的分区数)和SampleFrac(CV的分数)。在此功能中,我一直在改变partitons的数量,后来我分析了测试和训练拆分的时间:
for i in range(1,6):
training = []
testing= []
for train, test in xf.partition_users(ratings[['user', 'item', 'rating']], i, xf.SampleFrac(0.2)):
training.append(train)
testing.append(test)
testing = pd.concat(testing, ignore_index=True)
training = pd.concat(training, ignore_index=True)
print("Shape of testing:",testing.shape)
print("Shape of training:",training.shape)
输出:
Shape of testing: (20000, 3)
Shape of training: (80000, 3)
Shape of testing: (20000, 3)
Shape of training: (180000, 3)
Shape of testing: (20000, 3)
Shape of training: (280000, 3)
Shape of testing: (20000, 3)
Shape of training: (380000, 3)
Shape of testing: (20000, 3)
Shape of training: (480000, 3)
我试图理解为什么选定数量的分区和SampleFrac导致此输出。我期望以下输出:
Shape of testing: (20000, 3)
Shape of training: (80000, 3)
Shape of testing: (40000, 3)
Shape of training: (160000, 3)
Shape of testing: (60000, 3)
Shape of training: (240000, 3)
Shape of testing: (80000, 3)
Shape of training: (320000, 3)
Shape of testing: (100000, 3)
Shape of training: (400000, 3)
有人可以向我解释我哪里错了吗?