Question

我在python中获得了31个样本的数据集。我想在30个训练样本和1个样本测试样本中随机分割数据集30次，我该怎么办？

现在我只是使用前30个用于训练，最后一个用于测试：

training_this_round = training [0:30]

testing_this_round = training [30:31]

如何随机选取矩阵的行？ Training是包含我所有初始数据集的变量。

Answer 1

我喜欢random.shuffle这类事情。

让我们创建一个包含31个样本的虚拟数据集（我们会说它们是整数）：

training = range(31)

现在我们可以使用shuffle将此设置划分为两个随机子组：

import random
# copy training to preserve the order of the original dataset
this_round = training[:]
# permute the elements
random.shuffle(this_round)
# separate into training and test
training_this_round = this_round[:30]
testing_this_round = this_round[30:31]

基本上，这会使样本按随机顺序排列（就像洗牌一样），然后拿顶卡进行测试，然后使用其余的进行训练。我喜欢这个是它扩展到其他类型的分裂（例如，将前3个卡片分配到测试集中，然后将另外5个卡片放入验证集中，并将其余部分用于训练）。

因为您只使用一个样本进行测试，所以通过随机选择一张卡片（样本）并从卡片中删除它也可以轻松地做其他事情：

# pick an index into training at random
select = random.randint(0, len(training) - 1)
# test set is a single sample (not a list)
testing_this_round = training[select]
# training set is all elements except the one chosen for testing
training_this_round = [x for (i, x) in enumerate(training) if i != select]

Answer 2

第三方数组工具包（如numpy）可以使以下内容更容易管理而不会出现错误，第三方机器学习包（如scikit-learn）已经有了解决交叉问题的更高级别的解决方案-validation。但假设我们从表面上看待你的问题，并且手工和徒步做所有事情，这是一种应该有效的方法：

import random

indices = list(range(len(dataset)))
random.shuffle(indices)  # shuffle just once before folding: this ensures we don't re-use any test fold indices

validation_results = []
leave_n_out = 1
for test_start in range(0, len(indices), leave_n_out):  # work through the different folds of the cross-validation
    test_stop = test_start + leave_n_out

    testing_this_round  = [dataset[i] for i in indices[test_start:test_stop]]
    training_this_round = [dataset[i] for i in indices[:test_start] + indices[test_stop:]]

    model = train(training_this_round)  # whatever that involves
    validation_results.append( test(model, testing_this_round) ) # whatever that involves

随机选择数据集中的样本索引

2 个答案: