我正在研究稀疏自动编码器模型,该模型具有15个卷积层和21个转置卷积层。我在多GPU系统中运行我的代码。这段代码在小型数据集中运行良好,但是在大型数据集上运行时,我遇到了OMM资源耗尽错误问题。我将批次大小更改为8,但仍然面临相同的错误。任何帮助将不胜感激。 我已将批次大小减小为8,但是仍然存在问题。
跟踪:
[[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Exception in thread QueueRunnerThread-tower_1/shuffle_batch/random_shuffle_queue-tower_1/shuffle_batch/random_shuffle_queue_enqueue:
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1205, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[30003] and type half on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
[[Node: tower_1/DecodeRaw = DecodeRaw[little_endian=true, out_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_1/ReaderReadV2:1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: tower_1/DecodeRaw/_193 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_15_tower_1/DecodeRaw", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/device:GPU:1"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.