Question

为了给出一些上下文，我试图在tensorflow中实现一个类似于Personalized Bayesian Ranking和word2vec中使用的负抽样方案。

简而言之，归结为从巨大的稀疏矩阵中获得两个随机样本 - 来自正条目的一些样本（即非零条目）和一些负条目的样本（即零条目）。我已经在raw numpy / scipy中实现了一个实现（见下文）。

def subsampler(data, num_pos=10, num_neg=10):
    """ Obtain random batch size made up of positive and negative samples

    Parameters
    ----------
    data : scipy.sparse.coo_matrix
       Sparse matrix to obtain random samples from
    num_pos : int
       Number of positive samples
    num_negative : int
       Number of negative samples

    Returns
    -------
    positive_row : np.array
       Row ids of the positive samples
    positive_col : np.array
       Column ids of the positive samples
    positive_data : np.array
       Data values in the positive samples
    negative_row : np.array
       Row ids of the negative samples
    negative_col : np.array
       Column ids of the negative samples

    Note
    ----
    We are not return negative data, since the negative values
    are always zero.
    """
    N, D = data.shape
    y_data = data.data
    y_row = data.row
    y_col = data.col

    # store all of the positive (i, j) coords
    idx = np.vstack((y_row, y_col)).T
    idx = set(map(tuple, idx.tolist()))
    while True:
        # get positive sample
        positive_idx = np.random.choice(len(y_data), num_pos)
        positive_row = y_row[positive_idx].astype(np.int32)
        positive_col = y_col[positive_idx].astype(np.int32)
        positive_data = y_data[positive_idx].astype(np.float32)

        # get negative sample
        negative_row = np.zeros(num_neg, dtype=np.int32)
        negative_col = np.zeros(num_neg, dtype=np.int32)
        for k in range(num_neg):
            i, j = np.random.randint(N), np.random.randint(D)
            while (i, j) in idx:
                i, j = np.random.randint(N), np.random.randint(D)
                negative_row[k] = i
                negative_col[k] = j

        yield (positive_row, positive_col, positive_data,
               negative_row, negative_col)

这实际上运行得很好 - 但是当我尝试将其扩展到更多内核时，这被证明是一个瓶颈（按照document，传递这些值feed_dict并不容易扩展。）

现在，我意识到张量流具有非常类似的预制采样器，例如tf.nn.uniform_candidate_sampler和tf.nn.fixed_unigram_candidate_sampler。但是，我对文档有点兴奋，特别是在tf.nn.uniform_candidate_sampler上。乍一看，我不能立即清楚这个功能是否会明确产生阴性样本（没有任何阳性样本）。这甚至可以使用正确的功能吗？或者为此任务编写新的C ++操作是否必要？

可能提出了类似的问题here和here

Answer 1

您可以尝试使用低级别TensorFlow distributions几乎完全按照您在numpy中的方式实现此功能。所需的大部分内容已经实现，因为您可以使用统一分发，您实际上可以创建自己想要的任何分发，我会一直这样做。例如，创建一个统一的采样节点并从中抽取长度为N的向量：

import tensorflow as tf

N = 100

uni_pdf = tf.distributions.Uniform()
z = uni_pdf.sample(N)

if __name__ == '__main__':
    with tf.Session() as sess:
        print(sess.run(z))

我对tf.nn采样器一无所知，因为我不使用它们。

如何在张量流中生成负样本

1 个答案: