Question

我正在研究稀疏自动编码器模型，该模型具有15个卷积层和21个转置卷积层。我在多GPU系统中运行我的代码。这段代码在小型数据集中运行良好，但是在大型数据集上运行时，我遇到了OMM资源耗尽错误问题。我将批次大小更改为8，但仍然面临相同的错误。任何帮助将不胜感激。我已将批次大小减小为8，但是仍然存在问题。

跟踪：

[[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Exception in thread QueueRunnerThread-tower_1/shuffle_batch/random_shuffle_queue-tower_1/shuffle_batch/random_shuffle_queue_enqueue:
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
    enqueue_callable()
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1205, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[30003] and type half on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
         [[Node: tower_1/DecodeRaw = DecodeRaw[little_endian=true, out_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_1/ReaderReadV2:1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

ResourceExhaustedError：分配具有形状的张量时出现OOM [30003]

0 个答案: