我最近将一组张量流代码从队列更改为数据集,以利用验证数据集的迭代器重置功能。除了在第三次验证迭代过程中始终发生OOM问题外,它工作正常。 (验证有700多个步骤,代码总是在第三次验证中大约400+步骤崩溃)
*val_step = 434, step_cost = 62.8875665665, total_cost = 24353.1902707
2017-10-27 11:47:08.602743: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.50MiB. Current allocation summary follows.
2017-10-27 11:47:08.605489: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-27 11:47:08.605520: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-27 11:47:08.605536: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
....
....
....
2017-10-27 11:47:08.749632: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 14. 83GiB
2017-10-27 11:47:08.749646: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 15927646618
InUse: 15927646464
MaxInUse: 15927646464
NumAllocs: 457706168
MaxAllocSize: 42035200
2017-10-27 11:47:08.750506: W tensorflow/core/common_runtime/bfc_allocator.cc: 277] ****************************************************************************************************
2017-10-27 11:47:08.750540: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,60,256]
Traceback (most recent call last):
File "lstm_cls_att_emb.py", line 1091, in <module>
tf.app.run()
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "lstm_cls_att_emb.py", line 1087, in main
training()
File "lstm_cls_att_emb.py", line 746, in training
batch_check = FLAGS.val_batch_check)
File "lstm_cls_att_emb.py", line 411, in process_chunk
feed_dict=feed_dict)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,60,256]
[[Node: model_att_1/attention/strided_slice_199 = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=5, ellipsis_mask=0, end_mask=5, new_axis_mask=0, shrink_axis_mask=0, _device="/job:localhost/replica:0/task:0/gpu: 0"](model_att_1/attention/concat, model_att_1/attention/strided_slice_199/stack, model_att_1/attention/ strided_slice_199/stack_1, model_att_1/attention/strided_slice_199/stack_2)]]
[[Node: div_1/_69195 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_11676_div_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'model_att_1/attention/strided_slice_199', defined at:
File "lstm_cls_att_emb.py", line 1091, in <module>
tf.app.run()
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "lstm_cls_att_emb.py", line 1087, in main
training()
File "lstm_cls_att_emb.py", line 594, in training
scope=model_scope)
File "lstm_cls_att_emb.py", line 320, in build_model
name="attention")
File "/data/wzhan/workspace/new_fraud_models/attention_emb.py", line 223, in __init__
memory = self.memory[:, time_step:time_step+attention_window_size, :] # batch_size * window_size * dim
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 509, in _SliceHelper
name=name)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 677, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3744, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,60,256]
[[Node: model_att_1/attention/strided_slice_199 = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=5, ellipsis_mask=0, end_mask=5, new_axis_mask=0, shrink_axis_mask=0, _device="/job:localhost/replica:0/task:0/gpu: 0"](model_att_1/attention/concat, model_att_1/attention/strided_slice_199/stack, model_att_1/attention/ strided_slice_199/stack_1, model_att_1/attention/strided_slice_199/stack_2)]]
[[Node: div_1/_69195 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_11676_div_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]*
每次保存当前模型后,我都会打电话
sess.run(validation_iterator.initializer)
重置迭代器并开始验证步骤。
其他信息: Tensorflow 1.3.0 张量层1.6.5 cuda 8 cudnn 6