水库采样

时间:2010-04-10 07:45:03

标签: algorithm random

要从未确定大小的数组中检索 k 随机数,我们使用称为水库采样的技术。任何人都可以通过示例代码简要介绍它是如何发生的吗?

4 个答案:

答案 0 :(得分:31)

我实际上并没有意识到这有一个名字,所以我从头开始证明并实施了这个:

import random
def random_subset( iterator, K ):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len( result ) < K:
            result.append( item )
        else:
            s = int(random.random() * N)
            if s < K:
                result[ s ] = item

    return result

来自:http://web.archive.org/web/20141026071430/http://propersubset.com:80/2010/04/choosing-random-elements.html

接近结尾的证据。

答案 1 :(得分:8)

按照Knuth(1981)的描述,储层采样(算法R)可以实现如下:

import random

def sample(iterable, n):
    """
    Returns @param n random items from @param iterable.
    """
    reservoir = []
    for t, item in enumerate(iterable):
        if t < n:
            reservoir.append(item)
        else:
            m = random.randint(0,t)
            if m < n:
                reservoir[m] = item
    return reservoir

答案 2 :(得分:1)

爪哇

import java.util.Random;

public static void reservoir(String filename,String[] list)
{
    File f = new File(filename);
    BufferedReader b = new BufferedReader(new FileReader(f));

    String l;
    int c = 0, r;
    Random g = new Random();

    while((l = b.readLine()) != null)
    {
      if (c < list.length)
          r = c++;
      else
          r = g.nextInt(++c);

      if (r < list.length)
          list[r] = l;

      b.close();}
}

答案 3 :(得分:0)

Python解决方案

import random

class RESERVOIR_SAMPLING():
    def __init__(self, k=1000):
        self.reservoir = [] 
        self.k = k
        self.nb_processed = 0

    def add_to_reservoir(self, sample):
        self.nb_processed +=1
        if(self.k >= self.nb_processed):
            self.reservoir.append(sample)
        else:
            #randint(a,b) gives a<=int<=b
            j = random.randint(0,self.nb_processed-1)
            if(j < k):
                self.reservoir[j] = sample

k = 10
samples = [i for i in range(10)] * k
res = RESERVOIR_SAMPLING(k)
for sample in samples:
    res.add_to_reservoir(sample)

print(res.reservoir)

out[1]: [9, 8, 4, 8, 3, 5, 1, 7, 0, 9]