Question

我使用random.sample从非常大的范围中采样，具体取决于输入负载。有时样本本身非常大，因为它是一个列表，它占用了大量的内存。

应用程序不一定使用列表中的所有值。如果random.sample可以返回列表生成器而不是列表本身，那将会很棒。

现在我有一个包装器，它将大输入范围划分为相等大小的存储桶，并使用randint从每个n / sample_size存储桶中选择一个随机数。

编辑：在我的情况下，输入是连续的，我有这个包装函数来模拟random.sample作为生成器，但这并不是真正复制功能，因为它最后会跳过一些元素。

import random
def samplegen( start, end, sample_size ):
   bktlen = ( end - start ) / sample_size
   for i in xrange( sample_size ): #this skips the last modulo elements
      st = start + (i * bktlen)
      yield random.randrange( st, st + bktlen )

Answer 1

由于您评论说订单并不重要（我曾询问它是否必须是随机的或可以排序），这可能是一个选项：

import random

def sample(n, k):
    """Generate random sorted k-sample of range(n)."""
    for i in range(n):
        if random.randrange(n - i) < k:
            yield i
            k -= 1

通过数字并将样本中的每一个包含在概率中 numberOfNumbersStillNeeded / numberOfNumbersStillLeft。

演示：

>>> for _ in range(5):
        print(list(sample(100, 10)))

[7, 16, 41, 50, 55, 56, 61, 76, 89, 96]
[5, 13, 24, 28, 34, 35, 40, 64, 80, 95]
[9, 18, 19, 36, 38, 39, 61, 73, 84, 85]
[23, 24, 26, 28, 40, 53, 62, 76, 77, 91]
[2, 12, 21, 41, 60, 68, 70, 72, 90, 91]

Answer 2

为什么不像以下内容 - 集合seen只会增长到k的函数，而不一定是population的大小：

import random

def sample(population, k):
    seen = set()

    for _ in range(k):
        element = random.randrange(population)
        while element in seen:
            element = random.randrange(population)

        yield element
        seen.add(element)

for n in sample(1000000, 10):
    print(n)

另一种方法可能是使用原始桶设计，但使用非均匀桶，其索引本身是随机采样的：

import random

def samplegen(start, end, sample_size):
    random_bucket_indices = random.sample(range(start, end), sample_size)
    sorted_bucket_indices = sorted(random_bucket_indices) + [end + 1]
    for index in random_bucket_indices:
        yield random.randrange(index, sorted_bucket_indices[sorted_bucket_indices.index(index) + 1])

python有一个内置的方式来返回列表生成器而不是random.sample列表

2 个答案: