Question

我正在尝试增加TensorFlow中的GPU利用率，但我发现子图执行没有并行化。这是工作示例（tensorflow版本r.012）：

import tensorflow as tf
import numpy as np
from tensorflow.python.client import timeline

#initialize graph
tf.reset_default_graph()
sess = tf.Session()

# some parameters
input_dim = 10000
output_dim = 100
num_hidden = 10000
batch_size = 256

首先我们创建两个网络：

#specify two networks with random inputs as data
with tf.device('/gpu:0'):
    # first network
    with tf.variable_scope('net1'):
        tf_data1 = tf.random_normal(shape=[batch_size, input_dim])
        w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
        b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
        l1 = tf.add(tf.matmul(tf_data1, w1), b1)
        w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
        result1 = tf.matmul(l1, w2)

    # second network
    with tf.variable_scope('net2'):
        tf_data2 = tf.random_normal(shape=[batch_size, input_dim])
        w1 = tf.get_variable('w1', shape=[input_dim, num_hidden], dtype=tf.float32)
        b1 = tf.get_variable('b1', shape=[num_hidden], dtype=tf.float32)
        l1 = tf.add(tf.matmul(tf_data1, w1), b1)
        w2 = tf.get_variable('w2', shape=[num_hidden, output_dim], dtype=tf.float32)
        result2 = tf.matmul(l1, w2)

这是我们感兴趣的：

    #the result that we are interested
    out = tf.add(result1, result2)

现在我们初始化并运行会话：

sess.run(tf.global_variables_initializer()) #initialize variables

# run out operation with trace
run_metadata = tf.RunMetadata() 
sess.run(out,
        options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
        run_metadata=run_metadata )

# write trace to file
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
trace_file = open('trace.ctf.json', 'w')
trace_file.write(trace.generate_chrome_trace_format())

在trace中，我们可以看到以下内容：

第一个Matmul用于net1，第二个Matmul用于net2。

问题：

1 - 由于result1不依赖于result2，为什么在调用父操作“out”时不会并行处理这些操作？

2-我在定义图表时做错了什么？从documentation我了解到Tensorflow会自动执行并发。

3-有没有办法在这个级别实现并发？

由于

Answer 1

Re（1）TensorFlow默认使用单个GPU流。如果您在CPU上运行代码，您将看到并行性。为了获得更好的GPU利用率，最好增加批量大小/内核大小。

Re（2）您的图表似乎已正确定义。自动并行化主要适用于CPU。

Re（3）从1.0开始无法在TensorFlow GPU上运行多计算流代码。

如何在TensorFlow中并行化子图执行？

1 个答案: