我在集群的一个节点上运行tensorflow-gpu代码,出现此错误,我无法弄清楚发生了什么,我进行了搜索,有人告诉该代码可能创建多个线程,但无法修复。有人可以帮忙吗?谢谢。
2018-07-12 16:30:47.271380: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ******************************************************************************______________________
2018-07-12 16:30:47.271434: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[132961,32,13,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/share/spandh.ami1/sw/std/python/anaconda3-5.1.0/v5.1.0/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[132961,32,13,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: conv2d/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_neighbor_placeholder_0_1/_143, conv2d/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: Mean_1/_145 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_194_Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
答案 0 :(得分:0)
您的内存已用完。 Tensorflow尝试分配所有GPU内存,无论何时创建会话,它都会让肮脏的小手获得帮助。因此,如果您的程序启动了多个进程,每个进程都创建了一个会话,则第二个会话将基本上没有任何工作可言,并且死于强大的“ OOM!”。
一种解决方案是配置tensorflow来根据需要分配内存,但会降低效率:
TF_CONFIG_ = tf.ConfigProto()
TF_CONFIG_.gpu_options.allow_growth = True
sess = tf.Session(config = TF_CONFIG_)