我正在使用GeForce Nvidia 2080 Ti GPU进行研究。我正在尝试建立用于脑肿瘤分割的Unet模型。我的代码运行了数百个批处理,然后发出了内存不足错误(OOM)。可能是什么问题呢?这是我的训练代码 '''
def train_model(self,model):
history=""
print(model.summary())
for ep in range(self.num_epoch):
for batch in range(self.number_of_batches):
print(batch,"/",self.number_of_batches,"/",ep)
self.batch_images,self.batch_labels=self.get_batch(batch,self.all_files,file_format='channels_first')
history=model.fit(x=self.batch_images,
y=self.batch_labels,
shuffle=True,
epochs=1,
verbose=1)
#self.save_model_weights(self.model, history, epoch=batch)
print("Epoch loss",ep,"==",np.average(history.history['loss']))
#save weights on each iteration
self.save_model_weights(model,history,epoch=ep)
'''
and here is the generated error
OP_REQUIRES failed at gather_op.cc:155 : Resource exhausted: OOM when allocating tensor with shape[5,4,240,240] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-02-06 20:06:02.680769: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[5,4,240,240] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node GatherV2}}]]
[[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[IteratorGetNext/_4]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
2020-02-06 20:06:02.681752: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[5,4,240,240] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node GatherV2}}]]
[[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Traceback (most recent call last):
File "E:/PyCharmProjects/TF_tutorials/BrainSeg/seg.py", line 335, in <module>
mu.train_model(segnet)
File "E:/PyCharmProjects/TF_tutorials/BrainSeg/seg.py", line 296, in train_model
verbose=1)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
use_multiprocessing=use_multiprocessing)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
total_epochs=epochs)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
batch_outs = execution_function(iterator)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
distributed_function(input_fn))
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
result = self._call(*args, **kwds)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
5/5 [==============================] - 0s 2ms/sample
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
self.captured_inputs)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
ctx=ctx)
File "E:\PyCharmProjects\TF_tutorials\venv\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[5,4,240,240] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node GatherV2}}]]
[[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[IteratorGetNext/_4]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[5,4,240,240] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node GatherV2}}]]
[[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_7060]
Function call stack:
distributed_function -> distributed_function
任何人都可以帮助我解决此问题吗?