Question

我需要在python中模拟超几何分布（没有替换的采样元素的花哨字）。

设置：有一个装满人口许多大理石的包。有两种类型的大理石，红色和绿色（在以下实施中，大理石表示为True和False）。从包中取出的弹珠数量是样本。

以下是我为此问题提出的两个实现，但它们都开始降低人口速度＆gt; 10 ^ 8

def pull_marbles(sample, population=100):
    assert population % 2 == 0
    marbles = [x < population / 2 for x in range(0,population)]
    chosen = []
    for i in range(0,sample):
        choice = random.randint(0, population - i - 1)
        chosen.append(marbles[choice])
        del marbles[choice]
    return marbles

此实现非常易读，并且清楚地遵循问题的设置。但是，它必须创建一个大小 population 的列表，这似乎是瓶颈。

def pull_marbles2(sample, population=100):
    assert population % 2 == 0
    return random.sample([x < population / 2 for x in range(0, population)], sample)

这个实现使用了random.sample函数，希望能加快速度。不幸的是，它没有解决生成长度人口列表的潜在瓶颈。

编辑：错误地，第一个代码示例返回大理石，这使得这个问题变得模棱两可。毫无疑问，我希望代码能够返回被“拉动”的红色大理石和绿色大理石的数量。很抱歉这个混乱 - 我会保留原始不正确的pull_marbles版本，但不要让现有的答案看起来无效。

Answer 1

不要用列表表示你的包只是使用两个整数计算红色和绿色大理石。每次拉动都是通过检查范围(0..red+green)的随机数小于red来完成的。如果是，则拉出红色，因此减少red，否则会拉绿色，因此减少green。

这样你就必须迭代地完成所有操作，但我想这应该不是问题。但可能存在我现在无法想到的优化，无需迭代地执行此操作即可提取大量数字。

def pull_marbles(sample, population=100):
  red = population / 2
  green = (population+1) / 2  # round up just to ensure red+green == population
  for i in range(sample):
    choice = random.randint(1, red + green)
    if choice <= red:  # red pulled
      red -= 1
    else:
      green -= 1
  return (red, green)

Answer 2

这需要时间与sample（而不是population）成比例。虽然您没有这么说，但您的代码似乎假设袋子中每种颜色的大理石都有相同数量。这里的代码如下，但可以很容易地摆弄以使用其他假设：

def pull_marbles(sample, population=100):
    from random import random
    assert population % 2 == 0
    chosen = []
    nTrue = population / 2.0
    nTotal = float(population)
    for _ in xrange(sample):
        if random() < nTrue / nTotal:
            chosen.append(True)
            nTrue -= 1.0
        else:
            chosen.append(False)
        nTotal -= 1.0
    return chosen

Answer 3

def get_sample(sample_size ,population_size):
   reds=population_size/2
   greens = population_size/2
   marbles = 
   sample = []
   for i in range(sample_size):
       red_prob = 1.0*red/(red+green)
       grn_prob = 1.0*green/(red+green)
       #the second argument is the probabily of picking one color or another
       choice = numpy.random.choice([0,1],p=[red_prob,grn_prob])
       sample.append(choice)
       if choice == 0: reds -= 1
       else: greens -= 1
   return sample

你不需要整个清单...... 只需在你的变量之间随机选择一个与理论列表相匹配的概率

旁注

marbles = [x < population/2 for x in range(population)]  # SLOW
#takes  69 us with population of 1k
#takes memoryerror with population of 10^8 (2.5 seconds for 1/8th of the 10^8 population)
marbles = [False]*(population/2) + [True]*(population/2) #much FASTER!!!
#takes 8.6 us for population of 1k
#takes 272 ms for half the list so about 544 ms total
marbles = [True,False]*(population/2) #fastest ...
#2.19 us with population of 1k
#329 ms with population of 10^8

Answer 4

似乎列表是不必要的。尝试这样的事情：

def pull_marbles(sample, population=100):
    assert population % 2 == 0
    marbles = [x < population / 2 for x in range(0,population)]
    total_chosen = 0 # number of times you sampled it. this would always == population but included for clarity
    true_chosen = 0 # number of samples that were True
    for i in range(0,sample):
        choice = random.randint(0, population - i - 1)
        if marbles[choice]: true_chosen += 1
        total_chosen += 1
        del marbles[choice]
    return true_chosen, total_chosen

这将返回两个整数，其中比率是出现的数字

Answer 5

我的两位 - 与其他人相似。计算挑选每种颜色的概率，然后将这些颜色与随机数进行比较 - 累积选择。

import random
from operator import itemgetter

least_probable = color = itemgetter(0)
most_probable = probability = itemgetter(1)

def select(pop, samp):
    assert pop % 2 == 0 and samp < pop
    choices = (random.random() for _ in xrange(samp))
##    choices = (random.uniform(0.0, 1.0) for _ in xrange(samp))
##    choices = (random.triangular() for _ in xrange(samp))
    num_red = num_green = 0    
    total_red = total_green = pop / 2.0
    for choice in choices:
        p_red = total_red / pop
        p_green = total_green / pop
        marbles = [('RED', p_red), ('GREEN', p_green)]
        marbles.sort(key = probability)
        if choice <= probability(least_probable(marbles)):
            marble = color(least_probable(marbles))
        else:
            marble = color(most_probable(marbles))
        if marble is 'RED':
            num_red += 1
            total_red -= 1
        else:
            num_green += 1
            total_green -= 1
        pop -= 1
##        print marbles, choice, marble
    return ('RED', num_red), ('GREEN', num_green)

for thing in (select(100000000, 1000) for _ in xrange(20)):
    print thing

模拟从袋子里拉出大理石而不需要更换（高效）

5 个答案: