为什么Tensorflow-GPU的中期内存不足?

时间:2018-08-13 09:46:39

标签: python tensorflow keras

我的问题的解决方案

不幸的是,我的问题在应该被重复的问题中没有被回答。虽然确实在训练过程中更改了图形,但调用finalize并不能解决问题,因为Keras是潜在的问题。找到正确答案here。对于每种模型,我需要在编译后调用_make_predict_function()。在完成和拟合之前,需要通过调用model.fit()来对我调用的模型predict()进行“预热”,如答案中所述。

答案还解释了为什么发生这种情况; Keras试图通过尽可能晚地构建图形来节省内存。

原始问题

我正在使用Keras和Tensorflow-GPU在Nvidia Tesla K80上运行的LSTM编码器-解码器模型进行训练,该内核在Gentoo机器上具有两个内核。根据{{​​1}},没有其他进程在使用GPU,TensorFlow可以正常访问GPU。我的批量大小是4。

培训进行了一段时间(没有任何警告)。但是,经过一段时间后,会出现内存不足异常,并且训练会在中旬停止。我不知道在训练期间在某个时期内会如何发生内存错误,并且之前没有任何警告,因为我认为一旦分配了张量,就不需要额外的内存了。

过去,Tensorflow分配了我的GPU内存的10%以上时通知我,但这一次没有发生。

以下是OOM发生时的一些信息(在tensorflow记录了每个正在使用的块之后),它发生在计划的50的第34个时期。

nvidia-smi

我知道我正在处理大张量,但我不明白为什么它们会随着时间的推移而变得越来越大,更不用说训练期间了。

更新:评论链接到another question,其答案建议使用2018-08-11 06:15:07.676836: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 8.43GiB 2018-08-11 06:15:07.676848: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: Limit: 11286285517 InUse: 9053307136 MaxInUse: 10209047296 NumAllocs: 975991019 MaxAllocSize: 1223803648 2018-08-11 06:15:07.698578: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ******************************************_________*********************_________******************* 2018-08-11 06:15:07.815583: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "GridSearch.py", line 139, in <module> for (models, vals) in gs: File "../lstms/GridSearch.py", line 25, in next_function_call yield self.function_call(**val), val File "../lstms/modeling.py", line 168, in define_and_train train.fit([x1, x2], y, epochs=n_epoch, callbacks=[checkpointer]) File "/usr/lib64/python3.6/site-packages/keras/engine/training.py", line 1042, in fit validation_steps=validation_steps) File "/usr/lib64/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop outs = f(ins_batch) File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__ return self._call(inputs) File "/usr/lib64/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call fetched = self._callable_fn(*array_vals) File "/usr/lib64/python3.6/site-packages/tensorflow/python/client/session.py", line 1454, in __call__ self._session._session, self._handle, args, status, None) File "/usr/lib64/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,7328,9420] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: dec_dense_48/add = Add[T=DT_FLOAT, _class=["loc:@training_24/Adam/gradients/dec_dense_48/add_grad/Sum"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](dec_dense_48/Reshape_2, dec_dense_48/Reshape_3)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Node: loss_24/mul/_2957 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3053_loss_24/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 。在使用Keras功能API定义了模型之后,在开始训练之前,我最终确定了训练模型。但是打电话给tf.get_default_graph().finalize然后引发model.fit()

那么这是Keras中的一个错误,那我就不能在不修改图形的情况下训练模型吗?还是完全是另一个问题?

0 个答案:

没有答案