我正在运行inception distributed training code。由于官方版本仅支持每台机器上的单个gpu,因此我修改了代码以使用multi-gpu,multi-machines。
在查看Issue 54和distribute tensorflow with multiple gpu后,我发现最方便的方法是将每个gpu视为worker
,这可以在{{1}的帮助下轻松实现}}。在这里,我使用ZhuFengdaaa's codes(感谢作者!)。主要逻辑是添加with tf.device()
并使用gpu_id
。
具体来说,我使用两台机器:with tf.device('/job:worker/task:%d/gpu:%d' % (FLAGS.task_id, FLAGS.gpu_id))
和pc1
,每台机器有4个GPU(Titan X,内存为12 GB),每台机器使用2个GPU,总共有4个工作人员。我还将这两台机器用作参数服务器。我正在使用的TensorFlow版本是pc2
。我用于培训的脚本如下:
v1.0.0-alpha
然后我在# run worker_0 on pc1 on GPU 0
bazel-bin/inception/imagenet_distributed_train \
--batch_size=10 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=0 \
--gpu_id=0 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
# run worker_1 on pc1 on GPU 1
bazel-bin/inception/imagenet_distributed_train \
--batch_size=10 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=1 \
--gpu_id=1 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
# run ps_0 on pc1
CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
# run worker_2 on pc2 on GPU 0
bazel-bin/inception/imagenet_distributed_train \
--batch_size=10 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=2 \
--gpu_id=0 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
# run worker_3 on pc2 on GPU 1
bazel-bin/inception/imagenet_distributed_train \
--batch_size=10 \
--data_dir=$HOME/imagenet-data \
--job_name='worker' \
--task_id=3 \
--gpu_id=1 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
# run ps_1 on pc2
CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=1 \
--ps_hosts='pc1:3333,pc2:3333' \
--worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002'
和ran out of memory
上遇到以下worker_1
错误,程序崩溃了:
worker_3
我使用W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************************************************************************************xxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 128B. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[32]
Traceback (most recent call last):
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 65, in <module>
tf.app.run()
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 61, in main
inception_distributed_train.train(server.target, dataset, cluster_spec)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_distributed_train.py", line 291, in train
loss_value, step = sess.run([train_op, global_step])
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,149,149,32]
[[Node: inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub = Sub[T=DT_FLOAT, _device="/job:worker/replica:0/task:3/gpu:1"](inception_v3/conv0/Conv2D, inception_v3/conv0/BatchNorm/moments/StopGradient)]]
[[Node: gradients/AddN_86_S4321 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/cpu:0", send_device="/job:worker/replica:0/task:3/gpu:1", send_device_incarnation=-3559016887808157144, tensor_name="edge_32623_gradients/AddN_86", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/cpu:0"]()]]
Caused by op u'inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub', defined at:
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 65, in <module>
tf.app.run()
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 61, in main
inception_distributed_train.train(server.target, dataset, cluster_spec)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_distributed_train.py", line 161, in train
logits = inception.inference(images, num_classes, for_training=True)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_model.py", line 87, in inference
scope=scope)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/inception_model.py", line 87, in inception_v3
scope='conv0')
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/scopes.py", line 155, in func_with_args
return func(*args, **current_args)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/ops.py", line 234, in conv2d
outputs = batch_norm(conv, **batch_norm_params)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/scopes.py", line 155, in func_with_args
return func(*args, **current_args)
File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/ops.py", line 114, in batch_norm
mean, variance = tf.nn.moments(inputs, axis)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/nn_impl.py", line 619, in moments
y, axes, shift=shift, keep_dims=keep_dims, name=name)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/nn_impl.py", line 536, in sufficient_statistics
m_ss = math_ops.subtract(x, shift)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/math_ops.py", line 372, in subtract
return gen_math_ops._sub(x, y, name)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/gen_math_ops.py", line 2775, in _sub
result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10,149,149,32]
[[Node: inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub = Sub[T=DT_FLOAT, _device="/job:worker/replica:0/task:3/gpu:1"](inception_v3/conv0/Conv2D, inception_v3/conv0/BatchNorm/moments/StopGradient)]]
[[Node: gradients/AddN_86_S4321 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/cpu:0", send_device="/job:worker/replica:0/task:3/gpu:1", send_device_incarnation=-3559016887808157144, tensor_name="edge_32623_gradients/AddN_86", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/cpu:0"]()]]
检查了每台机器上GPU内存的使用情况。我发现即使我在每台机器上只使用一个gpu(在这种情况下,程序运行良好),每台机器上的所有四个gpus'存储器几乎完全占用,即达到~12GB。基于这种观察,我想知道nvidia-smi
错误是由于没有足够的空间在每台机器上再运行一个工作器。我试图通过设置ran out of memory
使程序根据运行时分配使用尽可能多的GPU内存,但错误仍然存在。
有人可以帮我解决这个问题吗?提前谢谢。