Question

我想对一个numpy数组（shape =（0，n））进行二次采样，这样火车和测试中的元素分布保持大致相同，或者每列的训练和测试中至少应该有一个元素。例如：

a = [1,2,3,1,3,3,2,1,2,1]
train = [1,1,2,2,3,3]
test = [1,1,2,3]

我想根据输出对我的参数和输出进行二次采样。现在，我使用np.random.choice来获取随机索引。有什么方法可以检查python中的分发

Answer 1

您可以使用Python中的collections内置库。

>>> from collections import Counter
>>> a = [1,2,3,1,3,3,2,1,2,1]
>>> count_a = Counter(a)
>>> count_a
Counter({1: 4, 2: 3, 3: 3})

Counter对象的工作方式类似于字典。从那里，您可以抽样出您想要的每个元素的百分比，即

>>> from itertools import chain
>>> train_fraction = 0.7
>>> train = list(chain.from_iterable([[i]*int(max(count_a[i]*train_fraction, 1)) for i in count_a.keys()]))
>>> train
[1, 1, 2, 2, 3, 3]

子采样numpy数组;在测试和训练中分布仍然相同

1 个答案: