Question

我尝试使用多GPU训练LSTM模型。 2-GPU进程的训练时间约为1-GPU进程的50％。但是，当我加载模型以预测测试数据时。 2-GPU的预测时间与1-GPU几乎相同。我不明白这种现象的原因。谁能给我一些建议？我的代码如下：

with tf.device('/cpu:0'):
    tower_grads = []
    reuse_vars = False

    # tf Graph input
    X = tf.placeholder("float", [None, timesteps, num_input])
    Y = tf.placeholder("float", [None, num_classes])

    # Loop over all GPUs and construct their own computation graph
    temp_prediction = [0]*num_gpus
    index_end = 0
    for j in range(num_gpus):
        with tf.device(assign_to_device('/gpu:{}'.format(j), ps_device='/cpu:0')):
            # Split data between GPUs
            _end = min((j+1)*batch_size,Y.shape[0])
            _x = X[j*batch_size: _end,:,:]
            _y = Y[j*batch_size: _end]
            prediction = lstm_multi(_x,reuse=reuse_vars)
            test_prediction = lstm_multi(_x,reuse=True)
            temp_prediction[j] = test_prediction
            loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
                                        logits=prediction, labels=_y))
            optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
            grads = optimizer.compute_gradients(loss_op)

            tower_grads.append(grads)
            reuse_vars = True
            index_end = j
    #mean_acc = np.mean(accuracy)
    batch_prediction = tf.concat(temp_prediction[0:index_end],0)
    tower_grads = average_gradients(tower_grads)
    train_op = optimizer.apply_gradients(tower_grads)

    init = tf.global_variables_initializer()
    saver = tf.train.Saver()
    #test
    with tf.Session() as sess:
        saver.restore(sess,model_path+'best-lolmodel.ckpt')
        for i in range(train_step_num):
            end = min((i+1)*batch_size*num_gpus,train_y.shape[0])
            batch_x = train_x[i*timesteps*batch_size*num_gpus:end*timesteps].toarray().reshape((-1,timesteps,num_input))
            batch_y = train_y[i*batch_size*num_gpus:end].toarray()
            result.append(sess.run(batch_prediction, feed_dict={X: batch_x, Y: batch_y}))
        result = np.vstack(result)

在我的代码中，我将每个2 *批处理馈送到2 gpu以进行预测，但是当num_gpus设置为1时。预测时间与2-gpu相同。

张量流代码的正向传播是否真的使用GPU？

0 个答案: