Question

我在集群的一个节点上运行tensorflow-gpu代码，出现此错误，我无法弄清楚发生了什么，我进行了搜索，有人告诉该代码可能创建多个线程，但无法修复。有人可以帮忙吗？谢谢。

2018-07-12 16:30:47.271380: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ******************************************************************************______________________
2018-07-12 16:30:47.271434: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[132961,32,13,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[132961,32,13,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: conv2d/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_neighbor_placeholder_0_1/_143, conv2d/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



     [[Node: Mean_1/_145 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_194_Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
    Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Answer 1

您的内存已用完。 Tensorflow尝试分配所有GPU内存，无论何时创建会话，它都会让肮脏的小手获得帮助。因此，如果您的程序启动了多个进程，每个进程都创建了一个会话，则第二个会话将基本上没有任何工作可言，并且死于强大的“ OOM！”。

一种解决方案是配置tensorflow来根据需要分配内存，但会降低效率：

TF_CONFIG_ = tf.ConfigProto()
TF_CONFIG_.gpu_options.allow_growth = True
sess = tf.Session(config = TF_CONFIG_)

tensorflow gpu，集群中的python资源耗尽错误

1 个答案: