分布式TensorFlow:在每台机器上运行multi-gpu:发生内存不足错误

时间:2017-02-10 13:21:33

标签: tensorflow distributed

我正在运行inception distributed training code。由于官方版本仅支持每台机器上的单个gpu,因此我修改了代码以使用multi-gpu,multi-machines。

在查看Issue 54distribute tensorflow with multiple gpu后,我发现最方便的方法是将每个gpu视为worker,这可以在{{1}的帮助下轻松实现}}。在这里,我使用ZhuFengdaaa's codes(感谢作者!)。主要逻辑是添加with tf.device()并使用gpu_id

具体来说,我使用两台机器:with tf.device('/job:worker/task:%d/gpu:%d' % (FLAGS.task_id, FLAGS.gpu_id))pc1,每台机器有4个GPU(Titan X,内存为12 GB),每台机器使用2个GPU,总共有4个工作人员。我还将这两台机器用作参数服务器。我正在使用的TensorFlow版本是pc2。我用于培训的脚本如下:

v1.0.0-alpha

然后我在# run worker_0 on pc1 on GPU 0 bazel-bin/inception/imagenet_distributed_train \ --batch_size=10 \ --data_dir=$HOME/imagenet-data \ --job_name='worker' \ --task_id=0 \ --gpu_id=0 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' # run worker_1 on pc1 on GPU 1 bazel-bin/inception/imagenet_distributed_train \ --batch_size=10 \ --data_dir=$HOME/imagenet-data \ --job_name='worker' \ --task_id=1 \ --gpu_id=1 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' # run ps_0 on pc1 CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \ --job_name='ps' \ --task_id=0 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' # run worker_2 on pc2 on GPU 0 bazel-bin/inception/imagenet_distributed_train \ --batch_size=10 \ --data_dir=$HOME/imagenet-data \ --job_name='worker' \ --task_id=2 \ --gpu_id=0 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' # run worker_3 on pc2 on GPU 1 bazel-bin/inception/imagenet_distributed_train \ --batch_size=10 \ --data_dir=$HOME/imagenet-data \ --job_name='worker' \ --task_id=3 \ --gpu_id=1 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' # run ps_1 on pc2 CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \ --job_name='ps' \ --task_id=1 \ --ps_hosts='pc1:3333,pc2:3333' \ --worker_hosts='pc1:2001,pc1:2002,pc2:2001,pc2:2002' ran out of memory上遇到以下worker_1错误,程序崩溃了:

worker_3

我使用W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************************************************************************************xxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 128B. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[32] Traceback (most recent call last): File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/home/AIJ/tensorflow/_python_build/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_distributed_train.py", line 291, in train loss_value, step = sess.run([train_op, global_step]) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,149,149,32] [[Node: inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub = Sub[T=DT_FLOAT, _device="/job:worker/replica:0/task:3/gpu:1"](inception_v3/conv0/Conv2D, inception_v3/conv0/BatchNorm/moments/StopGradient)]] [[Node: gradients/AddN_86_S4321 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/cpu:0", send_device="/job:worker/replica:0/task:3/gpu:1", send_device_incarnation=-3559016887808157144, tensor_name="edge_32623_gradients/AddN_86", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/cpu:0"]()]] Caused by op u'inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub', defined at: File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 65, in <module> tf.app.run() File "/home/AIJ/tensorflow/_python_build/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/imagenet_distributed_train.py", line 61, in main inception_distributed_train.train(server.target, dataset, cluster_spec) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_distributed_train.py", line 161, in train logits = inception.inference(images, num_classes, for_training=True) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/inception_model.py", line 87, in inference scope=scope) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/inception_model.py", line 87, in inception_v3 scope='conv0') File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/ops.py", line 234, in conv2d outputs = batch_norm(conv, **batch_norm_params) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/scopes.py", line 155, in func_with_args return func(*args, **current_args) File "/home/AIJ/tf_models/models/inception/bazel-bin/inception/imagenet_distributed_train.runfiles/inception/inception/slim/ops.py", line 114, in batch_norm mean, variance = tf.nn.moments(inputs, axis) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/nn_impl.py", line 619, in moments y, axes, shift=shift, keep_dims=keep_dims, name=name) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/nn_impl.py", line 536, in sufficient_statistics m_ss = math_ops.subtract(x, shift) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/math_ops.py", line 372, in subtract return gen_math_ops._sub(x, y, name) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/ops/gen_math_ops.py", line 2775, in _sub result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 2395, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/AIJ/tensorflow/_python_build/tensorflow/python/framework/ops.py", line 1264, in __init__ self._traceback = _extract_stack() ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10,149,149,32] [[Node: inception_v3/conv0/BatchNorm/moments/sufficient_statistics/Sub = Sub[T=DT_FLOAT, _device="/job:worker/replica:0/task:3/gpu:1"](inception_v3/conv0/Conv2D, inception_v3/conv0/BatchNorm/moments/StopGradient)]] [[Node: gradients/AddN_86_S4321 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:1/cpu:0", send_device="/job:worker/replica:0/task:3/gpu:1", send_device_incarnation=-3559016887808157144, tensor_name="edge_32623_gradients/AddN_86", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/cpu:0"]()]] 检查了每台机器上GPU内存的使用情况。我发现即使我在每台机器上只使用一个gpu(在这种情况下,程序运行良好),每台机器上的所有四个gpus'存储器几乎完全占用,即达到~12GB。基于这种观察,我想知道nvidia-smi错误是由于没有足够的空间在每台机器上再运行一个工作器。我试图通过设置ran out of memory使程序根据运行时分配使用尽可能多的GPU内存,但错误仍然存​​在。

有人可以帮我解决这个问题吗?提前谢谢。

0 个答案:

没有答案