Question

我有两个函数的实现，它计算frobenius范数并减去跟踪。此函数应用于4D张量x的维3中的所有向量。然后总结所有结果。我用它作为一个回忆录的一部分。 TensorFlow的版本是0.9。

我的第一个实现使用了tf.batch_ *函数。

def test1(x):
    """x: [batch, height, width, channels]"""
    s = x.get_shape().as_list()
    a = tf.reshape(x, [-1, s[3], 1])
    c = tf.batch_matmul(a, a, adj_y=True)
    c2 = tf.square(c)
    diag = tf.batch_matrix_diag_part(c2)
    return tf.reduce_sum(c2) - tf.reduce_sum(diag)

这有效，但中间张量c是大于张量x的通道时间，这限制了我的批量大小。因此，我尝试了一种基于map_fn的方法：

def fn(x):
    x1 = tf.reshape(x, [-1, 1])
    c1 = tf.matmul(x1, x1, transpose_b=True)
    c2 = tf.square(c1)
    t1 = tf.trace(c2)
    return tf.reduce_sum(c2)- t1)

def test2(x):
    """x: [batch, height, width, channels]"""
    s = x.get_shape().as_list()
    a = tf.reshape(x, [-1, s[3]])
    return tf.reduce_sum(tf.map_fn(fn, a))

当我运行第二个功能时，我收到许多（50+）个消息，如：

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] 
PoolAllocator: After 16084 get requests, put_count=20101
evicted_count=4000 eviction_rate=0.198995 and
unsatisfied allocation rate=0

test2的运行时间大约是test1运行时的45倍。

如果parallel_iterations = 10，map_fn的内存使用量应该是10 * channels * channel的顺序，远低于test1。

所以现在的问题是：为什么map_fn方法需要更长时间，为什么它似乎使用更多内存而不是更少？

TensorFlow map_fn性能和内存使用

0 个答案: