随机选择数据集中的样本索引

时间:2016-12-19 17:07:51

标签: python random

我在python中获得了31个样本的数据集。我想在30个训练样本和1个样本测试样本中随机分割数据集30次,我该怎么办?

现在我只是使用前30个用于训练,最后一个用于测试:

  

training_this_round = training [0:30]

     

testing_this_round = training [30:31]

如何随机选取矩阵的行? Training是包含我所有初始数据集的变量。

2 个答案:

答案 0 :(得分:2)

我喜欢random.shuffle这类事情。

让我们创建一个包含31个样本的虚拟数据集(我们会说它们是整数):

training = range(31)

现在我们可以使用shuffle将此设置划分为两个随机子组:

import random
# copy training to preserve the order of the original dataset
this_round = training[:]
# permute the elements
random.shuffle(this_round)
# separate into training and test
training_this_round = this_round[:30]
testing_this_round = this_round[30:31]

基本上,这会使样本按随机顺序排列(就像洗牌一样),然后拿顶卡进行测试,然后使用其余的进行训练。我喜欢这个是它扩展到其他类型的分裂(例如,将前3个卡片分配到测试集中,然后将另外5个卡片放入验证集中,并将其余部分用于训练)。

因为您只使用一个样本进行测试,所以通过随机选择一张卡片(样本)并从卡片中删除它也可以轻松地做其他事情:

# pick an index into training at random
select = random.randint(0, len(training) - 1)
# test set is a single sample (not a list)
testing_this_round = training[select]
# training set is all elements except the one chosen for testing
training_this_round = [x for (i, x) in enumerate(training) if i != select]

答案 1 :(得分:2)

第三方数组工具包(如numpy)可以使以下内容更容易管理而不会出现错误,第三方机器学习包(如scikit-learn)已经有了解决交叉问题的更高级别的解决方案-validation。但假设我们从表面上看待你的问题,并且手工和徒步做所有事情,这是一种应该有效的方法:

import random

indices = list(range(len(dataset)))
random.shuffle(indices)  # shuffle just once before folding: this ensures we don't re-use any test fold indices

validation_results = []
leave_n_out = 1
for test_start in range(0, len(indices), leave_n_out):  # work through the different folds of the cross-validation
    test_stop = test_start + leave_n_out

    testing_this_round  = [dataset[i] for i in indices[test_start:test_stop]]
    training_this_round = [dataset[i] for i in indices[:test_start] + indices[test_stop:]]

    model = train(training_this_round)  # whatever that involves
    validation_results.append( test(model, testing_this_round) ) # whatever that involves