我的问题的解决方案
不幸的是,我的问题在应该被重复的问题中没有被回答。虽然确实在训练过程中更改了图形,但调用finalize
并不能解决问题,因为Keras是潜在的问题。找到正确答案here。对于每种模型,我需要在编译后调用_make_predict_function()
。在完成和拟合之前,需要通过调用model.fit()
来对我调用的模型predict()
进行“预热”,如答案中所述。
答案还解释了为什么发生这种情况; Keras试图通过尽可能晚地构建图形来节省内存。
原始问题
我正在使用Keras和Tensorflow-GPU在Nvidia Tesla K80上运行的LSTM编码器-解码器模型进行训练,该内核在Gentoo机器上具有两个内核。根据{{1}},没有其他进程在使用GPU,TensorFlow可以正常访问GPU。我的批量大小是4。
培训进行了一段时间(没有任何警告)。但是,经过一段时间后,会出现内存不足异常,并且训练会在中旬停止。我不知道在训练期间在某个时期内会如何发生内存错误,并且之前没有任何警告,因为我认为一旦分配了张量,就不需要额外的内存了。
过去,Tensorflow分配了我的GPU内存的10%以上时通知我,但这一次没有发生。
以下是OOM发生时的一些信息(在tensorflow记录了每个正在使用的块之后),它发生在计划的50的第34个时期。
nvidia-smi
我知道我正在处理大张量,但我不明白为什么它们会随着时间的推移而变得越来越大,更不用说训练期间了。
更新:评论链接到another question,其答案建议使用2018-08-11 06:15:07.676836: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 8.43GiB
2018-08-11 06:15:07.676848: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 11286285517
InUse: 9053307136
MaxInUse: 10209047296
NumAllocs: 975991019
MaxAllocSize: 1223803648
2018-08-11 06:15:07.698578: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ******************************************_________*********************_________*******************
2018-08-11 06:15:07.815583: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "GridSearch.py", line 139, in <module>
for (models, vals) in gs:
File "../lstms/GridSearch.py", line 25, in next_function_call
yield self.function_call(**val), val
File "../lstms/modeling.py", line 168, in define_and_train
train.fit([x1, x2], y, epochs=n_epoch, callbacks=[checkpointer])
File "/usr/lib64/python3.6/site-packages/keras/engine/training.py", line 1042, in fit
validation_steps=validation_steps)
File "/usr/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__
return self._call(inputs)
File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/lib64/python3.6/site-packages/tensorflow/python/client/session.py", line 1454, in __call__
self._session._session, self._handle, args, status, None)
File "/usr/lib64/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: dec_dense_48/add = Add[T=DT_FLOAT, _class=["loc:@training_24/Adam/gradients/dec_dense_48/add_grad/Sum"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](dec_dense_48/Reshape_2, dec_dense_48/Reshape_3)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Node: loss_24/mul/_2957 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3053_loss_24/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
。在使用Keras功能API定义了模型之后,在开始训练之前,我最终确定了训练模型。但是打电话给tf.get_default_graph().finalize
然后引发model.fit()
那么这是Keras中的一个错误,那我就不能在不修改图形的情况下训练模型吗?还是完全是另一个问题?