Question

如何在TensorFlow中无需更换样品？与numpy.random.choice(n, size=k, replace=False)类似，对于某些非常大的整数n（例如100k-100M），以及较小的k（例如100-10k）。此外，我希望它在GPU上高效，因此this和tf.py_func等其他解决方案对我来说并不是一个真正的选择。任何使用tf.range(n)左右的东西也不是一个选项，因为n可能非常大。

Answer 1

这是一种方式：

n = ...
sample_size = ...
idx = tf.random_shuffle(tf.range(n))[:sample_size]

编辑：

我已经在下面发布了答案，但随后阅读了帖子的最后一行。如果您绝对无法生成大小为O的张量（ n ），我认为没有好方法可以做到（numpy.random.choice replace=False也被实现为切片的排列）。在您拥有唯一索引之前，您可以使用tf.while_loop：

n = ...
sample_size = ...
idx = tf.zeros(sample_size, dtype=tf.int64)
idx = tf.while_loop(
    lambda i: tf.size(idx) == tf.size(tf.unique(idx)),
    lambda i: tf.random_uniform(sample_size, maxval=n, dtype=int64))

编辑2：

关于上一个方法中的平均迭代次数。如果我们调用 n 可能值的数量而 k 所需向量的长度（ k ≤ n ），迭代成功的概率是：

p =产品（ n - （ i - 1）/ n ）我在1 .. k ）

由于每个迭代可被视为Bernoulli trial，因此第一次成功的平均试验次数为1 / p （proof here）。这是一个函数，用于计算Python中针对某些 k 和 n 值的平均试算次数：

def avg_iter(k, n):
    if k > n or n <= 0 or k < 0:
        raise ValueError()
    avg_it = 1.0
    for p in (float(n) / (n - i) for i in range(k)):
        avg_it *= p
    return avg_it

以下是一些结果：

+-------+------+----------+
|   n   |  k   | Avg iter |
+-------+------+----------+
|    10 |    5 | 3.3      |
|   100 |   10 | 1.6      |
|  1000 |   10 | 1.1      |
|  1000 |  100 | 167.8    |
| 10000 |   10 | 1.0      |
| 10000 |  100 | 1.6      |
| 10000 | 1000 | 2.9e+22  |
+-------+------+----------+

根据参数，您可以看到它变化多端。

但是，有可能以固定的步数构造一个向量，尽管我能想到的唯一算法是O（ k ²）。在纯Python中，它是这样的：

import random

def sample_wo_replacement(n, k):
    sample = [0] * k
    for i in range(k):
        sample[i] = random.randint(0, n - 1 - len(sample))
    for i, v in reversed(list(enumerate(sample))):
        for p in reversed(sample[:i]):
            if v >= p:
                v += 1
        sample[i] = v
    return sample

random.seed(100)
print(sample_wo_replacement(10, 5))
# [2, 8, 9, 7, 1]
print(sample_wo_replacement(10, 10))
# [6, 5, 8, 4, 0, 9, 1, 2, 7, 3]

这是在TensorFlow中做到这一点的可行方法（不确定是否最好）：

import tensorflow as tf

def sample_wo_replacement_tf(n, k):
    # First loop
    sample = tf.constant([], dtype=tf.int64)
    i = 0
    sample, _ = tf.while_loop(
        lambda sample, i: i < k,
        # This is ugly but I did not want to define more functions
        lambda sample, i: (tf.concat([sample,
                                      tf.random_uniform([1], maxval=tf.cast(n - tf.shape(sample)[0], tf.int64), dtype=tf.int64)],
                                     axis=0),
                           i + 1),
        [sample, i], shape_invariants=[tf.TensorShape((None,)), tf.TensorShape(())])
    # Second loop
    def inner_loop(sample, i):
        sample_size = tf.shape(sample)[0]
        v = sample[i]
        j = i - 1
        v, _ = tf.while_loop(
            lambda v, j: j >= 0,
            lambda v, j: (tf.cond(v >= sample[j], lambda: v + 1, lambda: v), j - 1),
            [v, j])
        return (tf.where(tf.equal(tf.range(sample_size), i), tf.tile([v], (sample_size,)), sample), i - 1)
    i = tf.shape(sample)[0] - 1
    sample, _ = tf.while_loop(lambda sample, i: i >= 0, inner_loop, [sample, i])
    return sample

一个例子：

with tf.Graph().as_default(), tf.Session() as sess:
    tf.set_random_seed(100)
    sample = sample_wo_replacement_tf(10, 5)
    for i in range(10):
        print(sess.run(sample))
# [3 0 6 8 4]
# [5 4 8 9 3]
# [1 4 0 6 8]
# [8 9 5 6 7]
# [7 5 0 2 4]
# [8 4 5 3 7]
# [0 5 7 4 3]
# [2 0 3 8 6]
# [3 4 8 5 1]
# [5 7 0 2 9]

这对于tf.while_loop非常有用，但众所周知TensorFlow并不是特别快，所以我不知道如果没有某种基准测试，你能用这种方法获得多快的速度

编辑4：

最后一种可能的方法。您可以在大小 c 的“块”中划分可能值的范围（0到 n ），然后从每个块中选择一个随机数量的数字，然后将所有数据随机播放。您使用的内存量受 c 的限制，并且您不需要嵌套循环。如果 n 可以被 c 整除，那么你应该得到一个完美的随机分布，否则最后一个“短”块中的值会得到一些额外的概率（这可能是微不足道的）视情况而定）。这是一个NumPy实现。考虑到不同的角落情况和陷阱有点长，但如果 c ≥ k 和 n mod c = 0几个部分得到简化。

import numpy as np

def sample_chunked(n, k, chunk=None):
    chunk = chunk or n
    last_chunk = chunk
    parts = n // chunk
    # Distribute k among chunks
    max_p = min(float(chunk) / k, 1.0)
    max_p_last = max_p
    if n % chunk != 0:
        parts += 1
        last_chunk = n % chunk
        max_p_last = min(float(last_chunk) / k, 1.0)
    p = np.full(parts, 2)
    # Iterate until a valid distribution is found
    while not np.isclose(np.sum(p), 1) or np.any(p > max_p) or p[-1] > max_p_last:
        p = np.random.uniform(size=parts)
        p /= np.sum(p)
    dist = (k * p).astype(np.int64)
    sample_size = np.sum(dist)
    # Account for rounding errors
    while sample_size < k:
        i = np.random.randint(len(dist))
        while (dist[i] >= chunk) or (i == parts - 1 and dist[i] >= last_chunk):
            i = np.random.randint(len(dist))
        dist[i] += 1
        sample_size += 1
    while sample_size > k:
        i = np.random.randint(len(dist))
        while dist[i] == 0:
            i = np.random.randint(len(dist))
        dist[i] -= 1
        sample_size -= 1
    assert sample_size == k
    # Generate sample parts
    sample_parts = []
    for i, v in enumerate(np.nditer(dist)):
        if v <= 0:
            continue
        c = chunk if i < parts - 1 else last_chunk
        base = chunk * i
        sample_parts.append(base + np.random.choice(c, v, replace=False))
    sample = np.concatenate(sample_parts, axis=0)
    np.random.shuffle(sample)
    return sample

np.random.seed(100)
print(sample_chunked(15, 5, 4))
# [ 8  9 12 13  3]

sample_chunked(100000000, 100000, 100000)的快速基准测试在我的计算机中需要大约3.1秒，而我无法运行上一个算法（上面的sample_wo_replacement函数）来完成相同的参数。应该可以在TensorFlow中实现它，也许可以使用tf.TensorArray，尽管需要付出很大的努力才能完全正确。

Answer 2

在此处使用gumbel-max技巧：https://github.com/tensorflow/tensorflow/issues/9260

z = -tf.log(-tf.log(tf.random_uniform(tf.shape(logits),0,1))) 
_, indices = tf.nn.top_k(logits + z,K)

索引就是您想要的。这个勾很容易〜！

样品无需更换

2 个答案: