Question

我正在尝试测试特定数据集群偶然发生的可能性。一种强有力的方法是蒙特卡罗模拟，其中数据和组之间的关联被随机重新分配很多次（例如10,000），并且使用聚类度量来比较实际数据与模拟以确定ap值。

我已经完成了大部分工作，指针将分组映射到数据元素，因此我计划随机重新分配指向数据的指针。问题：什么是快速的样本而无需替换，以便在重复数据集中随机重新分配每个指针？

例如（这些数据只是一个简化的例子）：

数据（n = 12值） - A组：0.1,0.2,0.4 / B组：0.5,0.6,0.8 / C组：0.4,0.5 / D组：0.2,0.2,0.3,0.5

对于每个复制数据集，我将具有相同的簇大小（A = 3，B = 3，C = 2，D = 4）和数据值，但会将值重新分配给簇。

为此，我可以生成1-12范围内的随机数，分配A组的第一个元素，然后生成1-11范围内的随机数，并分配A组中的第二个元素，依此类推。指针重新分配很快，我将预先分配所有数据结构，但没有替换的采样似乎是一个可能已经解决过很多次的问题。

首选逻辑或伪代码。

Answer 1

这是一些基于Knuth的书籍“数字算法”的算法3.4.2S进行无需替换的代码。

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

Jeffrey Scott Vitter在“An Efficient Algorithm for Sequential Random Sampling”，ACM Transactions on Mathematical Software，13（1），1987年3月，58-67中有一种更有效但更复杂的方法。

Answer 2

基于answer by John D. Cook的C ++工作代码。

#include <random>
#include <vector>

double GetUniform()
{
    static std::default_random_engine re;
    static std::uniform_real_distribution<double> Dist(0,1);
    return Dist(re);
}

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}

Answer 3

请参阅我对此问题的回答Unique (non-repeating) random numbers in O(1)?。同样的逻辑应该完成你想要做的事情。

Answer 4

受@John D. Cook's answer的启发，我在Nim中编写了一个实现。起初我很难理解它是如何工作的，所以我还广泛评论了一个例子。也许这有助于理解这个想法。另外，我稍微更改了变量名称。

chrome.management

Answer 5

当种群大小远大于样本大小时，上述算法效率低下，因为它们具有复杂性 O （ n ）， n 是人口规模。

当我还是学生时，我编写了一些算法，用于统一采样而无需替换，其平均复杂度为 O （ s log s ），其中 s 是样本大小。以下是二进制树算法的代码，平均复杂度 O （ s log s ），在R中：

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

该算法的复杂性在以下讨论： Rouzankin，P。S。; Voytishek，A。V.关于随机选择算法的成本。蒙特卡罗方法Appl。 5（1999），没有。 1,39-54。 http://dx.doi.org/10.1515/mcma.1999.5.1.39

如果您发现该算法有用，请参考。

另见： P. Gupta，G。P. Bhattacharjee。（1984）一种无需替换的随机抽样的有效算法。国际计算机数学杂志16：4，第201-209页。 DOI：10.1080 / 00207168408803438

Teuhola，J。和Nevalainen，O。1982.两种有效的随机抽样算法，无需替换。 / IJCM /，11（2）：127-140。 DOI：10.1080 / 00207168208803304

在上一篇论文中，作者使用哈希表并声称他们的算法具有 O （ s ）复杂度。还有一个快速哈希表算法，很快就会在pqR（非常快的R）中实现： https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

Answer 6

描述了另一种无需替换的采样算法here。

它类似于John D. Cook在他的回答和Knuth中描述的那个，但它有不同的假设：种群大小未知，但样本可以适合记忆。这个叫做“Knuth算法S”。

引用rosettacode文章：

选择前n个项目作为样本，因为它们可用;

对于第i项，其中i＆gt; n，随机保留n / i的机会。如果失败这个机会，样本保持不变。如果   不，随机（1 / n）替换之前选择的一个n   样品的项目。

对任何后续项目重复＃2。

Answer 7

我写了一个 survey of algorithms for sampling without replacement。我可能有偏见，但我推荐我自己的算法，用下面的 C++ 实现，为许多 k、n 值提供最佳性能，为其他值提供可接受的性能。假设 randbelow(i) 返回一个公平选择的小于 i 的随机非负整数。

void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
    auto t = n - k + 1;
    for (uint32_t i = 0; i < k; i++) {
        uint32_t r = randbelow(t + i);
        if (r < t) {
            result[i] = r;
        } else {
            result[i] = result[r - t];
        }
    }
    std::sort(result, result + k);
    for (uint32_t i = 0; i < k; i++) {
        result[i] += i;
    }
}

没有替换的采样算法？

7 个答案: