使用与初始偏差匹配的数字扩展列表

时间:2017-10-17 17:16:01

标签: python python-2.7 list pandas

我有一个基本上如此的列表:

Dalc = [1,2,1,1,1,1,1,1,1,1,1,1,5,1,1,3,1,2,1,1,1,1,2.......]

它目前包含395个元素,我试图对它进行扩展,以便保持相同的百分比1,2,3,4,4' s和5&。 Min = 1,Max = 5,我最初做了以下尝试将列表扩展到10000个元素:

from random import randint

....

Dalc_add = []
dalc_max = max(Dalc)
dalc_min = min(Dalc)
i = 0

while i < 10000:
    Dalc_add.append(randint(dalc_min, dalc_max))
    i = i + 1

Dalc.append(Dalc_add)

这给出了一个列表,其中包含了前395次迭代的初始偏差,但之后列表的其余部分看起来像:

[1,5,3,2,3,1,4,2,4,5,2,5,3,2,1,3,4,2,1,3,3,4,1........]

更多3,4和&amp; 5现在,它完全搞砸了我可以执行的任何统计分析。

如何扩展上面的列表,同时还保留列表值的重量和偏差(关于出现频率)?

2 个答案:

答案 0 :(得分:2)

您可以使用numpy.random.choice。这是从原始列表中随机抽样的。如果您将其提供给原始列表,则无需使用权重:

import numpy as np

Dalc = [1,2,1,1,1,1,1,1,1,1,1,1,5,1,1,3,1,2,1,1,1,1,2]
new_choices = np.random.choice(Dalc, size=10000)
Dalc += list(new_choices)

答案 1 :(得分:1)

您有两种选择:

from random import choices
Dalc.extend(choices(Dalc, k=numTimes))

from numpy.random import choice
Dalc.extend(choice(Dalc, size=numTimes))

这是从Dalc numTimes次随机选择的,这显然会使你的权重保持不变。

您应该使用哪种方法取决于两件事,numTimes是否很大以及Dalc是否很大。使用timeit

import timeit

print('Standard | Numpy')

print(timeit.timeit('choices([1,2,3,4,5], k=10000)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=10000)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5], k=1000)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=1000)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5], k=100)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=100)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5], k=10)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=10)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5], k=5)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=5)', setup='from numpy.random import choice', number=10000))

print()

print(timeit.timeit('choices([1,2,3,4,5]*10000, k=60)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5]*10000, size=60)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5]*1000, k=60)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5]*1000, size=60)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5]*100, k=60)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5]*100, size=60)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5]*10, k=60)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5]*10, size=60)', setup='from numpy.random import choice', number=10000))

print(timeit.timeit('choices([1,2,3,4,5], k=60)', setup='from random import choices', number=10000), end=' | ')
print(timeit.timeit('choice([1,2,3,4,5], size=60)', setup='from numpy.random import choice', number=10000))

给我们输出:

Standard | Numpy
25.372834796129872 | 1.8409739351390613
2.5144703081718696 | 0.316072358469512
0.2527455696737988 | 0.15912525398981003
0.03453532081119093 | 0.13720956183202304
0.021838018317897223 | 0.1544090297115197

1.2724984282899072 | 26.585005448108767
0.29600333450513006 | 2.7196871458182343
0.16926004909861803 | 0.4086584816186516
0.14861485298857957 | 0.16870138091688602
0.15621485532244606 | 0.1448146694886887

因此,如果numTimes非常大,Numpy是明显的赢家,但如果Dalc的大小非常大,那么似乎可以使用vanilla python。