Question

我想问一下目前的数据集API是否允许实施过采样算法？我处理高度不平衡的阶级问题。我当时认为在数据集解析过程中对特定类进行过采样会很好，即在线生成。我已经看到了rejection_resample函数的实现，但是这会删除样本而不是复制它们，并且它减慢了批处理生成（当目标分布与初始分布大不相同时）。我想要实现的是：举一个例子，看看它的类概率决定是否复制它。然后调用dataset.shuffle(...) dataset.batch(...)并获取迭代器。最好的（在我看来）方法是对低概率类进行过采样，并对最可能的类进行子采样。我想在网上做，因为它更灵活。

Answer 1

此问题已在问题#14451中得到解决。只需在此处发布anwser，以便其他开发人员更容易看到它。

示例代码对低频率类进行过采样并对高频率类进行欠采样，其中class_target_prob在我的情况下只是均匀分布。我想从最近的手稿A systematic study of the class imbalance problem in convolutional neural networks

中查看一些结论

通过调用：

来完成特定类的过采样

dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

以下是执行所有操作的完整代码段：

# sampling parameters
oversampling_coef = 0.9  # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5  # if equal to 0 then undersampling_filter() always returns True

def oversample_classes(example):
    """
    Returns the number of copies of given example
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    # soften ratio is oversampling_coef==0 we recover original distribution
    prob_ratio = prob_ratio ** oversampling_coef 
    # for classes with probability higher than class_target_prob we
    # want to return 1
    prob_ratio = tf.maximum(prob_ratio, 1) 
    # for low probability classes this number will be very large
    repeat_count = tf.floor(prob_ratio)
    # prob_ratio can be e.g 1.9 which means that there is still 90%
    # of change that we should return 2 instead of 1
    repeat_residual = prob_ratio - repeat_count # a number between 0-1
    residual_acceptance = tf.less_equal(
                        tf.random_uniform([], dtype=tf.float32), repeat_residual
    )

    residual_acceptance = tf.cast(residual_acceptance, tf.int64)
    repeat_count = tf.cast(repeat_count, dtype=tf.int64)

    return repeat_count + residual_acceptance


def undersampling_filter(example):
    """
    Computes if given example is rejected or not.
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    prob_ratio = prob_ratio ** undersampling_coef
    prob_ratio = tf.minimum(prob_ratio, 1.0)

    acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)

    return acceptance


dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

dataset = dataset.filter(undersampling_filter)

dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)

sess.run(tf.global_variables_initializer())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

更新＃1

这是一个简单的jupyter notebook，它在玩具模型上实现了上述过采样/欠采样。

Answer 2

tf.data.experimental.rejection_resample似乎是一种更好的方法，因为它不需要“ class_prob”和“ class_target_prob”功能。
尽管它是欠采样而不是过采样，但是目标分配和训练步骤相同，但效果相同。

Answer 3

此QnA对我很有帮助。所以我写了一篇有关我的相关经验的博客文章。

https://vallum.github.io/Optimizing_parallel_performance_of_resampling_with_tensorflow.html

我希望对重新采样的Tensorflow输入管道优化感兴趣的人能从中得到一些启发。

某些操作可能不必要地多余，但就我个人而言，它们并不是太大的性能降低因素。

 dataset = dataset.map(undersample_filter_fn, num_parallel_calls=num_parallel_calls) 
 dataset = dataset.flat_map(lambda x : x)

带有身份lambda函数的flat_map仅用于合并幸存（和空）记录

# Pseudo-code for understanding of flat_map after maps
#parallel calls of map('A'), map('B'), and map('C')
map('A') = 'AAAAA' # replication of A 5 times
map('B') = ''      # B is dropped
map('C') = 'CC'    # replication of C twice
# merging all map results
flat_map('AAAA,,CC') = 'AAAACC'

Tensorflow数据集API中的过采样功能

3 个答案:

更新＃1