Question

我有一个codebase，在这里我尝试复制GAN论文。我最近购买了第二个GPU，并且尝试更新代码以利用附加硬件。我尝试了Tensorflow cifar10 multi-gpu example中概述的方法。但是，当我使用2个GPU运行我的代码时，它的运行速度却没有任何提高，实际上，它的运行速度比使用单个GPU的运行速度慢了约10％。看着资源管理器，它说我的两个GPU都以大约50％的容量运行。

我在Windows 10上运行，带有python 3.7，TF 1.13。我使用的是2个2080ti和2950 cpu。

我的第一个想法是输入管道出现问题，因此我尝试了多种变体，例如使用多个数据迭代器，使用tf.data.experimental.prefetch_to_device（），不提供潜在矢量等。没有任何影响，并且由于我的CPU利用率均为5％，所以我很确定自己不会对此产生瓶颈。

我还尝试了一些设置塔的可变范围的方法，但这没有帮助。

我还尝试将批处理大小加倍，以防我只是没有通过gpu放入足够的数据，但是这导致计算每个批处理所需的时间是原来的2倍，而gpu的使用率为50％。

我的代码是here，相关部分是：

        d_grads = []
        g_grads = []
        for i in range(FLAGS.num_gpus):
            with tf.device('/gpu:{:d}'.format(i)):
                with tf.variable_scope('D', reuse=tf.AUTO_REUSE):
                    Dx, Dx_logits = self.discriminator(xs[i], yxs[i])
                with tf.variable_scope('G', reuse=tf.AUTO_REUSE):
                    G = self.generator(z[i], labels[i])
                with tf.variable_scope('D', reuse=tf.AUTO_REUSE):
                    Dg, Dg_logits = self.discriminator(G, labels[i])

                loss_d, loss_g = self.losses(Dx_logits, Dg_logits, Dx, Dg)

                vars = tf.trainable_variables()
                for v in vars:
                    print(v.name)
                d_params = [v for v in vars if v.name.startswith('D/')]
                g_params = [v for v in vars if v.name.startswith('G/')]

                d_grads.append(d_adam.compute_gradients(loss_d, var_list=d_params))
                g_grads.append(g_adam.compute_gradients(loss_g, var_list=g_params))

        d_opt = d_adam.apply_gradients(average_gradients(d_grads))
        g_opt = g_adam.apply_gradients(average_gradients(g_grads))

Answer 1

在您的gan.py文件中，看到第17行num_gpus已设置为1。其次，检查此链接中的Allowing GPU memory growth。默认情况下，TensorFlow会映射该进程可见的几乎所有GPU的所有GPU内存（取决于CUDA_VISIBLE_DEVICES）。在某些情况下，希望该过程仅分配可用内存的子集，或仅增加该过程所需的内存使用量。 TensorFlow在Session上提供了两个Config选项来控制它。

第一个是allow_growth选项，该选项尝试根据运行时分配仅分配尽可能多的GPU内存：它开始时分配的内存很少，并且随着Sessions的运行而出现更多的GPU内存需要。

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

第二种方法是per_process_gpu_memory_fraction选项，该选项确定每个可见GPU应该分配的总内存量的一部分。例如，您可以通过以下方式告诉TensorFlow仅分配每个GPU的总内存的40％：

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

如果您想真正绑定TensorFlow进程可用的GPU内存量，这将很有用。

在多GPU系统上使用单个GPU

如果系统中有多个GPU，则默认情况下将选择ID最低的GPU。如果要在其他GPU上运行，则需要明确指定首选项：

＃创建图形。

with tf.device('/device:GPU:2'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

如果您指定的设备不存在，您将得到InvalidArgumentError：

InvalidArgumentError: Invalid argument: Cannot assign a device to node 'b':
Could not satisfy explicit device specification '/device:GPU:2'
   [[{ {node b}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [3,2]
   values: 1 2 3...>, _device="/device:GPU:2"]()]]

如果希望TensorFlow在不存在指定设备的情况下自动选择一个现有的受支持的设备来运行操作，则可以在创建会话时在配置选项中将allow_soft_placement设置为True。< / p>

＃创建图形。

with tf.device('/device:GPU:2'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with allow_soft_placement and log_device_placement set
# to True.
sess = tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True, log_device_placement=True))
# Runs the op.
print(sess.run(c))

使用多个GPU

如果您想在多个GPU上运行TensorFlow，则可以多塔方式构建模型，其中将每个塔分配给不同的GPU。例如：

＃创建图形。

c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

GAN没有多GPU加速

1 个答案: