我尝试使用多GPU训练LSTM模型。 2-GPU进程的训练时间约为1-GPU进程的50%。但是,当我加载模型以预测测试数据时。 2-GPU的预测时间与1-GPU几乎相同。我不明白这种现象的原因。谁能给我一些建议?我的代码如下:
with tf.device('/cpu:0'):
tower_grads = []
reuse_vars = False
# tf Graph input
X = tf.placeholder("float", [None, timesteps, num_input])
Y = tf.placeholder("float", [None, num_classes])
# Loop over all GPUs and construct their own computation graph
temp_prediction = [0]*num_gpus
index_end = 0
for j in range(num_gpus):
with tf.device(assign_to_device('/gpu:{}'.format(j), ps_device='/cpu:0')):
# Split data between GPUs
_end = min((j+1)*batch_size,Y.shape[0])
_x = X[j*batch_size: _end,:,:]
_y = Y[j*batch_size: _end]
prediction = lstm_multi(_x,reuse=reuse_vars)
test_prediction = lstm_multi(_x,reuse=True)
temp_prediction[j] = test_prediction
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=prediction, labels=_y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
grads = optimizer.compute_gradients(loss_op)
tower_grads.append(grads)
reuse_vars = True
index_end = j
#mean_acc = np.mean(accuracy)
batch_prediction = tf.concat(temp_prediction[0:index_end],0)
tower_grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(tower_grads)
init = tf.global_variables_initializer()
saver = tf.train.Saver()
#test
with tf.Session() as sess:
saver.restore(sess,model_path+'best-lolmodel.ckpt')
for i in range(train_step_num):
end = min((i+1)*batch_size*num_gpus,train_y.shape[0])
batch_x = train_x[i*timesteps*batch_size*num_gpus:end*timesteps].toarray().reshape((-1,timesteps,num_input))
batch_y = train_y[i*batch_size*num_gpus:end].toarray()
result.append(sess.run(batch_prediction, feed_dict={X: batch_x, Y: batch_y}))
result = np.vstack(result)
在我的代码中,我将每个2 *批处理馈送到2 gpu以进行预测,但是当num_gpus设置为1时。预测时间与2-gpu相同。