我在python中获得了31个样本的数据集。我想在30个训练样本和1个样本测试样本中随机分割数据集30次,我该怎么办?
现在我只是使用前30个用于训练,最后一个用于测试:
training_this_round = training [0:30]
testing_this_round = training [30:31]
如何随机选取矩阵的行? Training是包含我所有初始数据集的变量。
答案 0 :(得分:2)
我喜欢random.shuffle
这类事情。
让我们创建一个包含31个样本的虚拟数据集(我们会说它们是整数):
training = range(31)
现在我们可以使用shuffle
将此设置划分为两个随机子组:
import random
# copy training to preserve the order of the original dataset
this_round = training[:]
# permute the elements
random.shuffle(this_round)
# separate into training and test
training_this_round = this_round[:30]
testing_this_round = this_round[30:31]
基本上,这会使样本按随机顺序排列(就像洗牌一样),然后拿顶卡进行测试,然后使用其余的进行训练。我喜欢这个是它扩展到其他类型的分裂(例如,将前3个卡片分配到测试集中,然后将另外5个卡片放入验证集中,并将其余部分用于训练)。
因为您只使用一个样本进行测试,所以通过随机选择一张卡片(样本)并从卡片中删除它也可以轻松地做其他事情:
# pick an index into training at random
select = random.randint(0, len(training) - 1)
# test set is a single sample (not a list)
testing_this_round = training[select]
# training set is all elements except the one chosen for testing
training_this_round = [x for (i, x) in enumerate(training) if i != select]
答案 1 :(得分:2)
第三方数组工具包(如numpy
)可以使以下内容更容易管理而不会出现错误,第三方机器学习包(如scikit-learn
)已经有了解决交叉问题的更高级别的解决方案-validation。但假设我们从表面上看待你的问题,并且手工和徒步做所有事情,这是一种应该有效的方法:
import random
indices = list(range(len(dataset)))
random.shuffle(indices) # shuffle just once before folding: this ensures we don't re-use any test fold indices
validation_results = []
leave_n_out = 1
for test_start in range(0, len(indices), leave_n_out): # work through the different folds of the cross-validation
test_stop = test_start + leave_n_out
testing_this_round = [dataset[i] for i in indices[test_start:test_stop]]
training_this_round = [dataset[i] for i in indices[:test_start] + indices[test_stop:]]
model = train(training_this_round) # whatever that involves
validation_results.append( test(model, testing_this_round) ) # whatever that involves