Question

我有一个集合列表，我希望对每个包含每个集合中的项目的n个不同样本进行采样。我不想要的是将它按顺序排列，所以例如我将使用第一组中的相同项目获得所有样本。我也不想创建所有笛卡尔产品，因为在效率方面可能无法实现...... 知道怎么做吗？甚至是接近这种行为的东西？

不起作用的示例：

(prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))

Answer 1

上述所有解决方案在迭代结束时都会浪费大量资源来过滤重复的结果。这就是为什么我想到的方法从开始到结束都具有（几乎）线性速度。

这个想法是：给（仅在您头上）标准订单笛卡尔积的每个结果一个索引。例如，对于A x B x C和2000 x 1 x 2 = 4000个元素： / p>

0: (A[0], B[0], C[0])
1: (A[1], B[0], C[0])
...
1999: (A[1999], B[0], C[0])
2000: (A[0], B[0], C[1])
...
3999: (A[1999], B[0], C[1])
done.

所以仍有一些问题需要解决：

如何获取可能的索引列表？ 答案：只需乘以2000*1*2=4000，下面的每个数字都是有效索引。
如何顺序生成随机索引而不重复？有两个答案：如果要使用已知样本量n的样本，只需使用random.sample(xrange(numer_of_indices), n)。但是，如果您还不知道样本大小（更一般的情况），则必须即时生成索引，以免浪费内存。在这种情况下，您可以仅使用index = random.randint(0, k - 1)生成k = numer_of_indices来获取第一个索引，而k = number_of_indices - n来获取第n个结果。只需检查下面的代码（请注意，我在此处使用了一个单侧链表来存储完成的索引。它使插入操作成为O（1）运算，并且我们在这里需要大量插入操作。）
如何从索引生成输出？ 答案：好吧，假设我们的索引为i。然后i % 2000将是结果的A的索引。现在i // 2000可以递归地视为剩余因子的笛卡尔积的索引。

这是我想出的代码：

def random_order_cartesian_product(*factors):
    amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
    index_linked_list = [None, None]
    for max_index in reversed(range(amount)):
        index = random.randint(0, max_index)
        index_link = index_linked_list
        while index_link[1] is not None and index_link[1][0] <= index:
            index += 1
            index_link = index_link[1]
        index_link[1] = [index, index_link[1]]
        items = []
        for factor in factors:
            items.append(factor[index % len(factor)])
            index //= len(factor)
        yield items

Answer 2

您可以使用sample lib中的random：

import random
[[random.sample(x,1)[0] for x in list_of_sets] for _ in range(n)]

例如：

list_of_sets = [{1,2,3}, {4,5,6}, {1,4,7}]
n = 3

可能的输出是：

[[2, 4, 7], [1, 4, 7], [1, 6, 1]]

编辑：

如果我们想避免重复，我们可以使用while循环并将结果收集到set。此外，您可以检查n是否有效，并返回笛卡尔积的无效n值：

chosen = set()
if 0 < n < reduce(lambda a,b: a*b,[len(x) for x in list_of_sets]):
    while len(chosen) < n:
        chosen.add(tuple([random.sample(x,1)[0] for x in list_of_sets]))
else:
    chosen = itertools.product(*list_of_sets)

Answer 3

以下生成器函数生成非重复样本。如果生成的样本数远小于可能的样本数，它将只能正常工作。它还要求集合的元素可以清除：

def samples(list_of_sets):
    list_of_lists = list(map(list, list_of_sets))  # choice only works on sequences
    seen = set()  # keep track of seen samples
    while True:
        x = tuple(map(random.choice, list_of_lists))  # tuple is hashable
        if x not in seen:
            seen.add(x)
            yield x

>>> lst = [{'b', 'a'}, {'c', 'd'}, {'f', 'e'}, {'g', 'h'}]
>>> gen = samples(lst)
>>> next(gen)
('b', 'c', 'f', 'g')
>>> next(gen)
('a', 'c', 'e', 'g')
>>> next(gen)
('b', 'd', 'f', 'h')
>>> next(gen)
('a', 'c', 'f', 'g')

Answer 4

Matmarbon的答案是正确的，这是一个完整的版本，带有示例，并对其中的一些内容进行了修改，以便于理解和使用：

import functools
import random

def random_order_cartesian_product(factors):
    amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
    print(amount)
    print(len(factors[0]))
    index_linked_list = [None, None]
    for max_index in reversed(range(amount)):
        index = random.randint(0, max_index)
        index_link = index_linked_list
        while index_link[1] is not None and index_link[1][0] <= index:
            index += 1
            index_link = index_link[1]
        index_link[1] = [index, index_link[1]]
        items = []
        for factor in factors:
            items.append(factor[index % len(factor)])
            index //= len(factor)
        yield items


factors=[
    [1,2,3],
    [4,5,6],
    [7,8,9]
]

n = 5

all = random_order_cartesian_product(factors)

count = 0

for comb in all:
  print(comb)
  count += 1
  if count == n:
    break

Answer 5

因为我不想重复，有时不可能代码不那么短。但正如@andreyF所说，random.sample完成了工作。也许还有一种更好的方法可以避免重复重采样直到存在足够的非重复采样，这是我迄今为止最好的方法。

import operator
import random
def get_cart_product(list_of_sets, n=None):
    max_products_num = reduce(operator.mul, [len(cluster) for cluster in list_of_sets], 1)
    if n is not None and n < max_products_num:
        refs = set()
        while len(refs) < n:
            refs.add(tuple(random.sample(cluster, 1)[0] for cluster in list_of_sets))
        return refs
        return (prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
    return itertools.product(*list_of_sets)

请注意，代码会假定冻结集的列表，否则应转换random.sample(cluster, 1)[0]。

如何从笛卡尔积中抽样而不在python中重复？

5 个答案: