我需要在python中模拟超几何分布(没有替换的采样元素的花哨字)。
设置:有一个装满人口许多大理石的包。有两种类型的大理石,红色和绿色(在以下实施中,大理石表示为True和False)。从包中取出的弹珠数量是样本。
以下是我为此问题提出的两个实现,但它们都开始降低人口速度> 10 ^ 8
def pull_marbles(sample, population=100):
assert population % 2 == 0
marbles = [x < population / 2 for x in range(0,population)]
chosen = []
for i in range(0,sample):
choice = random.randint(0, population - i - 1)
chosen.append(marbles[choice])
del marbles[choice]
return marbles
此实现非常易读,并且清楚地遵循问题的设置。但是,它必须创建一个大小 population 的列表,这似乎是瓶颈。
def pull_marbles2(sample, population=100):
assert population % 2 == 0
return random.sample([x < population / 2 for x in range(0, population)], sample)
这个实现使用了random.sample函数,希望能加快速度。不幸的是,它没有解决生成长度人口列表的潜在瓶颈。
编辑:错误地,第一个代码示例返回大理石,这使得这个问题变得模棱两可。毫无疑问,我希望代码能够返回被“拉动”的红色大理石和绿色大理石的数量。很抱歉这个混乱 - 我会保留原始不正确的pull_marbles版本,但不要让现有的答案看起来无效。
答案 0 :(得分:5)
不要用列表表示你的包只是使用两个整数计算红色和绿色大理石。每次拉动都是通过检查范围(0..red+green)
的随机数小于red
来完成的。如果是,则拉出红色,因此减少red
,否则会拉绿色,因此减少green
。
这样你就必须迭代地完成所有操作,但我想这应该不是问题。 但可能存在我现在无法想到的优化,无需迭代地执行此操作即可提取大量数字。
def pull_marbles(sample, population=100):
red = population / 2
green = (population+1) / 2 # round up just to ensure red+green == population
for i in range(sample):
choice = random.randint(1, red + green)
if choice <= red: # red pulled
red -= 1
else:
green -= 1
return (red, green)
答案 1 :(得分:2)
这需要时间与sample
(而不是population
)成比例。虽然您没有这么说,但您的代码似乎假设袋子中每种颜色的大理石都有相同数量。这里的代码如下,但可以很容易地摆弄以使用其他假设:
def pull_marbles(sample, population=100):
from random import random
assert population % 2 == 0
chosen = []
nTrue = population / 2.0
nTotal = float(population)
for _ in xrange(sample):
if random() < nTrue / nTotal:
chosen.append(True)
nTrue -= 1.0
else:
chosen.append(False)
nTotal -= 1.0
return chosen
答案 2 :(得分:1)
def get_sample(sample_size ,population_size):
reds=population_size/2
greens = population_size/2
marbles =
sample = []
for i in range(sample_size):
red_prob = 1.0*red/(red+green)
grn_prob = 1.0*green/(red+green)
#the second argument is the probabily of picking one color or another
choice = numpy.random.choice([0,1],p=[red_prob,grn_prob])
sample.append(choice)
if choice == 0: reds -= 1
else: greens -= 1
return sample
你不需要整个清单......
只需在你的变量之间随机选择一个与理论列表相匹配的概率
旁注
marbles = [x < population/2 for x in range(population)] # SLOW
#takes 69 us with population of 1k
#takes memoryerror with population of 10^8 (2.5 seconds for 1/8th of the 10^8 population)
marbles = [False]*(population/2) + [True]*(population/2) #much FASTER!!!
#takes 8.6 us for population of 1k
#takes 272 ms for half the list so about 544 ms total
marbles = [True,False]*(population/2) #fastest ...
#2.19 us with population of 1k
#329 ms with population of 10^8
答案 3 :(得分:0)
似乎列表是不必要的。尝试这样的事情:
def pull_marbles(sample, population=100):
assert population % 2 == 0
marbles = [x < population / 2 for x in range(0,population)]
total_chosen = 0 # number of times you sampled it. this would always == population but included for clarity
true_chosen = 0 # number of samples that were True
for i in range(0,sample):
choice = random.randint(0, population - i - 1)
if marbles[choice]: true_chosen += 1
total_chosen += 1
del marbles[choice]
return true_chosen, total_chosen
这将返回两个整数,其中比率是出现的数字
答案 4 :(得分:0)
我的两位 - 与其他人相似。计算挑选每种颜色的概率,然后将这些颜色与随机数进行比较 - 累积选择。
import random
from operator import itemgetter
least_probable = color = itemgetter(0)
most_probable = probability = itemgetter(1)
def select(pop, samp):
assert pop % 2 == 0 and samp < pop
choices = (random.random() for _ in xrange(samp))
## choices = (random.uniform(0.0, 1.0) for _ in xrange(samp))
## choices = (random.triangular() for _ in xrange(samp))
num_red = num_green = 0
total_red = total_green = pop / 2.0
for choice in choices:
p_red = total_red / pop
p_green = total_green / pop
marbles = [('RED', p_red), ('GREEN', p_green)]
marbles.sort(key = probability)
if choice <= probability(least_probable(marbles)):
marble = color(least_probable(marbles))
else:
marble = color(most_probable(marbles))
if marble is 'RED':
num_red += 1
total_red -= 1
else:
num_green += 1
total_green -= 1
pop -= 1
## print marbles, choice, marble
return ('RED', num_red), ('GREEN', num_green)
for thing in (select(100000000, 1000) for _ in xrange(20)):
print thing