Question

我有两个gpu（TitanX（Pascal）和GTX 1080）。我在尝试运行单线程图计算。该图是两个独立的矩阵乘法链（每个都分配给相应的gpu）。

以下是我正在使用的代码：

将tensorflow导入为tf 导入numpy为np 随机导入进口时间导入日志记录

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline


def test():
    n = 5000

    with tf.Graph().as_default():
        A1 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        A2 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        with tf.device('/gpu:0'):
            B1 = A1
            for l in xrange(10):
                B1 = tf.matmul(B1, A1)

        with tf.device('/gpu:1'):
            B2 = A2
            for l in xrange(10):
                B2 = tf.matmul(B2, A2)
            C = tf.matmul(B1, B2)

        run_metadata = tf.RunMetadata()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
            start = time.time()
            logging.info('started')
            A1_ = np.random.rand(n, n)
            A2_ = np.random.rand(n, n)
            sess.run([C],
                     feed_dict={A1: A1_, A2: A2_},
                     options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                     run_metadata=run_metadata)
            logging.info('writing trace')
            trace = timeline.Timeline(step_stats=run_metadata.step_stats)
            trace_file = open('timeline.ctf.json', 'w')
            trace_file.write(trace.generate_chrome_trace_format())
            logging.info('trace written')
            end = time.time()
            logging.info('computed')
            logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

完成需要20.4秒。
如果我将所有操作设置为gpu0（TitanX），则需要14秒才能完成。
如果我将所有操作设置为gpu1（GTX 1080），则需要19.8秒才能完成。

我可以看到tensorflow找到了gpus并且已经正确设置了所有设备。为什么没有加速使用两个gpu而不是一个？可能是gpus是不同型号的问题（AFAIK cuda允许它）？

感谢。

修改我更新了代码，为两个链使用不同的初始矩阵，否则tensorflow似乎做了一些优化。

以下是时间轴配置文件json-file链接：https://api.myjson.com/bins/23csi

Screenshot

这个时间表提出的问题多于答案：

为什么pid 7（gpu0）有两行执行？
pid 3和5中有多长MatMuls？（input0“_recv_A_0 / _3”，input1“_recv_A_0 / _3”，名称“MatMul”，op“MatMul”）
似乎每个操作都在gpu0上执行，除了pid 5。
在pid 3和pid 5的长MatMuls操作之后，有很多小的MatMul操作（无法从屏幕截图中看到）。这是什么？

Answer 1

首次在GPU上启动内核时可能会出现明显延迟，这可能是由PTXAS编译引起的。当你使用1个以上的GPU时，这个延迟可能是几秒钟的累积，所以在你的情况下，运行速度较慢，因为时间由额外的“初始内核启动”决定。对纯计算时间进行基准测试的一种方法是通过在每个GPU上至少执行一次cuda操作来“预热”。通过在2个TitanX卡上运行你的基准测试我观察到同样的缓慢，但是当我“预热”内核时，这个延迟消失了。

这是在预热之前：

这是预热后：以下是您修改后的代码以进行预热，以及删除任何TensorFlow＆lt; - ＆gt; Python传输。

import tensorflow as tf

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline
import logging, time
import numpy as np

def test():
    n = 5000

    with tf.device('/gpu:0'):
        A1 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A1')
        B1 = A1
        for l in xrange(10):
            B1 = tf.matmul(A1, B1, name="chain1")

    with tf.device('/gpu:1'):
        A2 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A2')
        B2 = A2
        for l in xrange(10):
            B2 = tf.matmul(A2, B2, name="chain2")
        C = tf.matmul(B1, B2)

    run_metadata = tf.RunMetadata()
    start = time.time()
    logging.info('started')
    sess = tf.InteractiveSession(config=tf.ConfigProto(allow_soft_placement=False, log_device_placement=True))
    sess.run(tf.initialize_all_variables())
    # do warm-run
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    run_metadata = tf.RunMetadata()
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    logging.info('writing trace')
    trace = timeline.Timeline(step_stats=run_metadata.step_stats)
    trace_file = open('timeline.ctf.json', 'w')
    trace_file.write(trace.generate_chrome_trace_format(show_memory=True))
    logging.info('trace written')
    end = time.time()
    logging.info('computed')
    logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

Answer 2

是不是因为您需要在计算C时在GPU之间传输数据？你能尝试将C放在cpu上吗？

with tf.device('/cpu:0'):
  C = tf.matmul(B1, B2)

TensorFlow 2-gpu慢于单个gpu

2 个答案: