Tensorflow:多GPU培训无法使所有GPU同时运行

时间:2017-11-10 04:32:07

标签: parallel-processing deep-learning tensorflow-gpu multi-gpu parallelism-amdahl

我有一台拥有3x 1080 GPU的机器。以下是培训代码:

    dynamic_learning_rate = tf.placeholder(tf.float32, shape=[])
    model_version = tf.constant(1, tf.int32)

    with tf.device('/cpu:0'):
        with tf.name_scope('Input'):
            # Input images and labels.
            batch_images,\
                batch_input_vectors,\
                batch_one_hot_labels,\
                batch_file_paths,\
                batch_labels = self.get_batch()

    grads = []
    pred = []
    cost = []

    # Define optimizer
    optimizer = tf.train.MomentumOptimizer(learning_rate=dynamic_learning_rate / self.batch_size,
                                           momentum=0.9,
                                           use_nesterov=True)

    split_input_image = tf.split(batch_images, self.num_gpus)
    split_input_vector = tf.split(batch_input_vectors, self.num_gpus)
    split_input_one_hot_label = tf.split(batch_one_hot_labels, self.num_gpus)
    for i in range(self.num_gpus):
        with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
            with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
                with tf.name_scope('Model'):
                    # Construct model
                    with tf.variable_scope("inference"):
                        tower_pred = self.model(split_input_image[i], split_input_vector[i], is_training=True)

                    pred.append(tower_pred)

                with tf.name_scope('Loss'):
                    # Define loss and optimizer
                    softmax_cross_entropy_cost = tf.reduce_mean(
                        tf.nn.softmax_cross_entropy_with_logits(logits=tower_pred, labels=split_input_one_hot_label[i]))

                    cost.append(softmax_cross_entropy_cost)

    # Concat variables
    pred = tf.concat(pred, 0)
    cost = tf.reduce_mean(cost)

    # L2 regularization
    trainable_vars = tf.trainable_variables()
    l2_regularization = tf.add_n(
        [tf.nn.l2_loss(v) for v in trainable_vars if any(x in v.name for x in ['weights', 'biases'])])
    for v in trainable_vars:
        if any(x in v.name for x in ['weights', 'biases']):
            print(v.name + ' - included for L2 regularization!')
        else:
            print(v.name)

    cost = cost + self.l2_regularization_strength*l2_regularization

    with tf.name_scope('Accuracy'):
        # Evaluate model
        correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(batch_one_hot_labels, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        prediction = tf.nn.softmax(pred, name='softmax')

    # Creates a variable to hold the global_step.
    global_step = tf.Variable(0, trainable=False, name='global_step')

    # Minimization
    update = optimizer.minimize(cost, global_step=global_step, colocate_gradients_with_ops=True)

我参加培训后:

Fri Nov 10 12:28:00 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   65C    P2    62W / 198W |   7993MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 33%   53C    P2   150W / 198W |   7886MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:05:00.0  On |                  N/A |
| 26%   54C    P2   170W / 198W |   7883MiB /  8108MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23228      C   python                                      7982MiB |
|    1     23228      C   python                                      7875MiB |
|    2      4793      G   /usr/lib/xorg/Xorg                            40MiB |
|    2     23228      C   python                                      7831MiB |
+-----------------------------------------------------------------------------+

Fri Nov 10 12:28:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   59C    P2    54W / 198W |   7993MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 33%   57C    P2   154W / 198W |   7886MiB /  8114MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:05:00.0  On |                  N/A |
| 27%   55C    P2   155W / 198W |   7883MiB /  8108MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23228      C   python                                      7982MiB |
|    1     23228      C   python                                      7875MiB |
|    2      4793      G   /usr/lib/xorg/Xorg                            40MiB |
|    2     23228      C   python                                      7831MiB |
+-----------------------------------------------------------------------------+

您会看到第一个GPU运行时,其他两个GPU将处于空闲状态,反之亦然。交替频率约为0.5秒。

对于单个GPU,培训速度大约为 650 [images/second] ,所有3个GPU 1050 [images/second]

对这个问题有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您需要确保所有可训练的变量都在控制器设备(通常是CPU)上,并且所有其他辅助设备(通常是GPU)都在并行使用CPU中的变量。