Question

由于未知原因，以下代码在GPU上的运行速度比在CPU上慢两倍。任何人都可以解释原因：

<div class="col-sm-3">
    {{form}}
</div>

注意：通常，GPU运行5秒钟，CPU运行3秒钟，使用numpy的CPU版本仅运行1.5秒钟。硬件：在Google Colab上运行的Tensorflow代码。在本地Intel Core i5-7267U上运行的Numpy代码。

数字版本：

import time
import tensorflow as tf

with tf.device('/device:GPU:0'):  # gpu takes: 5.132448434829712 seconds
    # with tf.device('/cpu:0'): # cpu takes: 3.440524101257324 seconds
    i = tf.constant(0)
    while_condition = lambda i: tf.less(i, 2 ** 20)
    a = tf.fill([16, 16], 1.1)
    b = tf.fill([16, 16], 2.2)
    def body(i):
        res = tf.matmul(a, b)
        # increment i
        add = tf.add(i, 1)

        return (add,)


    ini_matmul = tf.matmul(a, b)

    # do the loop:
    loop = tf.while_loop(while_condition, body, [i])

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(ini_matmul)  # force GPU to initilise anything it needs.

    t0 = time.time()
    sess.run(loop)

    t1 = time.time()
    print(t1 - t0)
sess.close()

更新

这与我联系越来越紧密，因为按比例放大矩阵并没有真正的帮助。这是其中的更新代码和数据（Titan XP卡/ Intel i7 CPU的运行）。本质上import numpy as np import time i = 0 a = np.full([16,16],1.1) b = np.full([16,16],2.2) t0 = time.time() while i < 2**20: a.dot(b) i += 1 t1 = time.time() print(t1-t0)的运行速度要慢得多。

tensorflow

Answer 1

这是一个有趣的问题。

在TensorFlow片段中，您看到的GPU和CPU执行之间的相对速度下降几乎可以肯定是由于GPU memory allocation overhead造成的。总而言之，cudaMalloc比malloc慢。 当且仅当加速超过内存分配时间之差时，此内存分配速度的降低才被请求的操作（在这种情况下为matmul）中的加速所抵消。当矩阵很大时，matmul总是如此。当矩阵较小时，情况并非如此，在您的示例中就是这种情况。为了验证此假设，请反复增加被乘数的大小并记录CPU和GPU的运行时间-如果确实存在内存分配问题，则两者应收敛，然后交叉。

Numpy运行时间与仅CPU运行时间之间的差异很可能是由于Numpy和TensorFlow代码之间的细微差别造成的。请注意，在Numpy代码中，您仅实例化了a和b一次。它看起来像您在TensorFlow代码中执行相同的操作一样，因为您只调用一次初始化，但是每次迭代中仍然填充张量！要了解原因，请注意tf.fill返回一个Tensor。根据定义，每次在包含Tensor的图形上调用sess.run时，都会填充a对象。因此，这两个片段实际上做的事情略有不同。更直接的比较是将TensorFlow代码段中的b和<style name="AppTheme.NoActionBar"> <item name="coordinatorLayoutStyle">@style/Widget.Design.CoordinatorLayout</item> </style>都设为tf.constant。

Answer 2

最后，我发现matmul操作不是由tensorflow执行的，因为它是图中的一个孤立节点。

Tensorflow的while循环在GPU上运行缓慢吗？

数字版本：

更新

2 个答案: