Question

我想在所有可能的combinations_with_replacement中生成一个随机组合。棘手的是，我希望每个可能的结果具有相同的概率，而不需要生成（甚至不隐含）所有可能的结果。

例如：

import itertools
import random

random.choice(list(itertools.combinations_with_replacement(range(4), 2)))

这种方法太慢（而且内存昂贵），因为它需要创建所有可能的组合，而我只想要一个。

如果我确定combinations_with_replacement会有多少random.randrange并且next与itertools.islice和itertools.combinations_with_replacement一起使用Route::group(['prefix' => '/'], function() { if ( Auth::check() ) { Route::get('/', 'BlogController@getUserBlog'); } else{ Route::get('/', 'PagesController@getIndex'); } });，那就不是那么糟糕了。这不需要生成所有可能的组合（最坏情况除外）。但它仍然太慢了。

另一方面，recipe mentioned in the itertools documentation速度很快但并非每种组合具有相同的概率。

Answer 1

嗯，我有点陷入困境，因为我发现了一种有效的算法，但我不知道为什么。所以，如果你想要的话，也许会议室里的一些数学家可以计算出概率，但它确实有效。我们的想法是一次选择一个元素，增加所选元素的概率。我怀疑推理必须与reservoir sampling相似，但我没有解决。

from random import choice
from itertools import combinations_with_replacement

population = ["A", "B", "C", "D"]
k = 3

def random_comb(population, k):
    idx = []
    indices = list(range(len(population)))
    for _ in range(k):
        idx.append(choice(indices))
        indices.append(idx[-1])
    return tuple(population[i] for i in sorted(idx))

combs = list(combinations_with_replacement(population, k))
counts = {c: 0 for c in combs}

for _ in range(100000):
    counts[random_comb(population, k)] += 1

for comb, count in sorted(counts.items()):
    print("".join(comb), count)

输出是100,000次运行后出现的每种可能性的次数：

AAA 4913
AAB 4917
AAC 5132
AAD 4966
ABB 5027
ABC 4956
ABD 4959
ACC 5022
ACD 5088
ADD 4985
BBB 5060
BBC 5070
BBD 5056
BCC 4897
BCD 5049
BDD 5059
CCC 5024
CCD 5032
CDD 4859
DDD 4929

Answer 2

由于您未对任务中的参数进行任何估算：此处小k的某种方法。

基本思路：验收拒绝抽样，如果某些部分解决方案不可行则完全重启（根据排序特征）。当然，不重启的概率随k!（compare with bogosort）而下降。没有使用额外的内存。

以下代码将此方法与原始方法，错误的天真方法和基于其他（现已删除）答案（具有upvote）的错误方法进行了比较。代码几乎是垃圾，只是为了演示紫色：

代码：

import itertools
import random
from time import perf_counter
from collections import deque
n = 30
k = 4
its = 100000  # monte-carlo analysis -> will take some time with these values!

sample_space = itertools.combinations_with_replacement(range(n), k)
flat_map = {}  # for easier counting / analysis
for ind, i in enumerate(sample_space):
    flat_map[i] = ind

def a(n, k):
    """ Original slow approach """
    return random.choice(list(itertools.combinations_with_replacement(range(n), k)))

def b(n, k):
    """ Naive attempt -> non-uniform """
    chosen = [random.choice(list(range(n))) for i in range(k)]
    return tuple(sorted(chosen))

def c(population, k):
  """ jdehesa solution (hopefully not broken by my modifications) """
  choices = [i for i in range(population) for _ in range(k)]
  return tuple([i for i in sorted(random.sample(choices, k))])

def d(n, k):
    """ Acceptance-rejection sampling with restart using python's list """
    chosen = []
    while True:
        if len(chosen) == k:
            return tuple(chosen)
        else:
            new_element = random.randint(0, n-1)
            if len(chosen) > 0:
                if new_element >= chosen[-1]:
                    chosen.append(new_element)
                else:
                    chosen = []
            else:
                chosen.append(new_element)
    return chosen

def d2(n, k):
    """ Acceptance-rejection sampling with restart using deque """

    chosen = deque()
    while True:
        if len(chosen) == k:
            return tuple(chosen)
        else:
            new_element = random.randint(0, n-1)
            if len(chosen) > 0:
                if new_element >= chosen[-1]:
                    chosen.append(new_element)
                else:
                    chosen = []
            else:
                chosen.append(new_element)
    return chosen

start = perf_counter()
a_result = [flat_map[a(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
b_result = [flat_map[b(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
c_result = [flat_map[c(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
d_result = [flat_map[d(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

start = perf_counter()
d2_result = [flat_map[d2(n, k)] for i in range(its)]
print('s: ', perf_counter() - start)

import matplotlib.pyplot as plt

f, arr = plt.subplots(5, sharex=True, sharey=True)
arr[0].hist(a_result, label='original')
arr[1].hist(b_result, label='naive (non-uniform)')
arr[2].hist(c_result, label='jdehesa (non-uniform)')
arr[3].hist(d_result, label='Acceptance-rejection restart -> list')
arr[4].hist(d2_result, label='Acceptance-rejection restart  -> deque')

for i in range(5):
    arr[i].legend()

plt.show()

输出：

s:  546.1523445801055
s:  1.272424016672062
s:  3.058098026099742
s:  12.665841491509354
s:  13.14264200539003

是的，我将这些标签置于某个次优位置。

替代时间：

仅将原始与基于deque的AR采样进行比较。同样只有相对时间在这里很重要。

n=100, k=3：

s:  22.6498539618067
s:  0.038274503506364965

n=100, k=4：

s:  7.047153613584993
s:  0.0009363589822841689

备注：有人可能会争辩说，如果内存允许此存储，原始方法应该重新使用样本空间（这会改变这些基准）。

生成随机（等概率）组合与替换

2 个答案:

替代时间：