Question

假设我有一个排序的值数组：

int n=4; // always lower or equal than number of unique values in array
int i[256] = {};
int v = {1 1 2 4 5 5 5 5 5 7 7 9 9 11 11 13}
// EX 1        ^         ^       ^       ^
// EX 2    ^                 ^         ^ ^
// EX 3    ^ ^           ^               ^

我想生成 n 个随机索引值i[0] ... i[n-1]，以便：

v[i[0]] ... v[i[n-1]]指向一个唯一数字（即不得两次指向5）
每个数字必须是同类的最右（即必须指向 last 5）
应该始终包含指向最终数字（在这种情况下为13）的索引。

到目前为止，我已经尝试过：

获取到唯一值的最后一个索引
改组索引
选择n个第一个索引

我正在C语言中实现此功能，因此我可以依靠的标准C函数越多，代码越短越好。（例如，shuffle不是标准的C函数，但是如果必须，我必须。）

Answer 1

创建最后一个索引值的数组

int last[] = { 1, 2, 3, 8, 10, 12, 14 };

Fisher-Yates shuffle数组。

从改组后的数组中提取前n-1个元素。

将索引添加到最终编号。

根据需要对结果数组进行排序。

Answer 2

该算法称为reservoir sampling，只要您知道需要多少样本但不知道要从中采样多少元素，就可以使用该算法。（其名称源于您始终保持正确数量的样本存储库的想法。当输入新值时，将其混合到存储库中，删除随机元素，然后继续。）

创建大小为sample的返回值数组n。
开始扫描输入阵列。每次找到新值时，都将其索引添加到sample的末尾，直到获得n个采样元素为止。
继续扫描阵列，但是现在当您发现新值时：

a。在[0，i）范围内选择一个随机数r，其中i是到目前为止看到的唯一值的数目。

b。如果r小于n，请使用新元素覆盖元素r。
到最后，假设您需要对sample进行排序。

要确保始终有样本中的最后一个元素，请运行上述算法以选择大小为n-1的样本。仅在找到更大的元素时才考虑使用新元素。

该算法的大小为v是线性的（加上最后一步中的n log n项用于排序。）如果已经有了每个值的最后一个索引列表，则速度更快算法（但是在开始采样之前，您会知道宇宙的大小；如果您不知道，则进行储层采样非常有用。）

实际上，从概念上讲，它与收集所有索引然后查找Fisher-Yates随机播放的前缀没有什么不同。但是它使用O（n）临时内存而不是足够用于存储整个索引列表，这可能被视为一个加号。

这是未经测试的示例C实现（要求您编写函数randrange()）：

/* Produces (in `out`) a uniformly distributed sample of maximum size
 * `outlen` of the indices of the last occurrences of each unique
 * element in `in` with the requirement that the last element must
 * be in the sample.
 * Requires: `in` must be sorted.
 * Returns: the size of the generated sample, while will be `outlen` 
 *          unless there were not enough unique elements.
 * Note: `out` is not sorted, except that the last element in the
 *       generated sample is the last valid index in `in`
 */
size_t sample(int* in, size_t inlen, size_t* out, size_t outlen) {
  size_t found = 0;
  if (inlen && outlen) {
    // The last output is fixed so we need outlen-1 random indices
    --outlen; 
    int prev = in[0];
    for (size_t curr = 1; curr < inlen; ++curr) {
      if (in[curr] == prev) continue;
      // Add curr - 1 to the output
      size_t r = randrange(0, ++found);
      if (r < outlen) out[r] = curr - 1;
      prev = in[curr];
    }
    // Add the last index to the output
    if (found > outlen) found = outlen;
    out[found] = inlen - 1;
  }
  return found;
}

选择随机索引到排序数组中

2 个答案: