我有一个包含约25000个实例的数据集(称为数据),我想将其分为训练集,开发集和测试集。我希望这样,
train set = 0.7*Data
development set = 0.1*Data
test set = 0.2*Data
进行拆分时,我希望对实例进行随机采样,并且不要在3组之间重复。这就是为什么我不能使用类似的东西
train_set = Data.sample(frac=0.7)
dev_set = Data.sample(frac=0.1)
train_set = Data.sample(frac=0.2)
其中数据实例可以在集合中重复。我是否缺少内置函数,或者您可以帮我编写一个用于执行此操作的函数吗?
我将使用数组演示我要寻找的示例。
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
splits = [0.7, 0.1, 0.2]
def splitFunction(data, array_of_splits):
// I need your help here
splits = splitFunction(A, splits)
#output
[[1, 3, 8, 9, 6, 7, 2], [4], [5, 0]]
提前谢谢!
答案 0 :(得分:0)
from random import shuffle
def splitFunction(data, array_of_splits):
data_copy = data[:] # copy data if don't want to change original array
shuffle(data_copy) # randomizes data
splits = []
startIndex = 0
for val in array_of_splits:
split = data_copy[startIndex:startIndex + val*len(data)]
startIndex = startIndex + val*len(data)
splits.append(split)
return splits