验证数据集迭代器重置三次后的OOM问题

时间:2017-10-27 17:44:22

标签: validation tensorflow iterator dataset

我最近将一组张量流代码从队列更改为数据集,以利用验证数据集的迭代器重置功能。除了在第三次验证迭代过程中始终发生OOM问题外,它工作正常。 (验证有700多个步骤,代码总是在第三次验证中大约400+步骤崩溃)

*val_step = 434, step_cost = 62.8875665665,  total_cost = 24353.1902707
2017-10-27 11:47:08.602743: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of   memory trying to allocate 7.50MiB.  Current allocation summary follows.
2017-10-27 11:47:08.605489: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256):   Total Chunks: 0,      Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in    use in bin.
2017-10-27 11:47:08.605520: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512):   Total Chunks: 0,      Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in    use in bin.
2017-10-27 11:47:08.605536: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024):  Total Chunks: 0,      Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in    use in bin.
....
....
....
2017-10-27 11:47:08.749632: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 14.    83GiB
2017-10-27 11:47:08.749646: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                 15927646618
InUse:                 15927646464
MaxInUse:              15927646464
NumAllocs:               457706168
MaxAllocSize:             42035200

2017-10-27 11:47:08.750506: W tensorflow/core/common_runtime/bfc_allocator.cc:                                        277] ****************************************************************************************************
2017-10-27 11:47:08.750540: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating    tensor with shape[128,60,256]
Traceback (most recent call last):
  File "lstm_cls_att_emb.py", line 1091, in <module>
    tf.app.run()
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "lstm_cls_att_emb.py", line 1087, in main
    training()
  File "lstm_cls_att_emb.py", line 746, in training
    batch_check = FLAGS.val_batch_check)
  File "lstm_cls_att_emb.py", line 411, in process_chunk
    feed_dict=feed_dict)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,60,256]
   [[Node: model_att_1/attention/strided_slice_199 = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=5,           ellipsis_mask=0, end_mask=5, new_axis_mask=0, shrink_axis_mask=0, _device="/job:localhost/replica:0/task:0/gpu:       0"](model_att_1/attention/concat, model_att_1/attention/strided_slice_199/stack, model_att_1/attention/               strided_slice_199/stack_1, model_att_1/attention/strided_slice_199/stack_2)]]
   [[Node: div_1/_69195 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0",         send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_11676_div_1",       tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'model_att_1/attention/strided_slice_199', defined at:
  File "lstm_cls_att_emb.py", line 1091, in <module>
    tf.app.run()
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "lstm_cls_att_emb.py", line 1087, in main
    training()
  File "lstm_cls_att_emb.py", line 594, in training
    scope=model_scope)
  File "lstm_cls_att_emb.py", line 320, in build_model
    name="attention")
  File "/data/wzhan/workspace/new_fraud_models/attention_emb.py", line 223, in __init__
    memory = self.memory[:, time_step:time_step+attention_window_size, :]  # batch_size * window_size * dim
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 509, in _SliceHelper
    name=name)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 677, in strided_slice
    shrink_axis_mask=shrink_axis_mask)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3744, in            strided_slice
    shrink_axis_mask=shrink_axis_mask, name=name)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in      apply_op
    op_def=op_def)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/data/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,60,256]
   [[Node: model_att_1/attention/strided_slice_199 = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=5,           ellipsis_mask=0, end_mask=5, new_axis_mask=0, shrink_axis_mask=0, _device="/job:localhost/replica:0/task:0/gpu:       0"](model_att_1/attention/concat, model_att_1/attention/strided_slice_199/stack, model_att_1/attention/               strided_slice_199/stack_1, model_att_1/attention/strided_slice_199/stack_2)]]
   [[Node: div_1/_69195 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0",         send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_11676_div_1",       tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]*

每次保存当前模型后,我都会打电话

sess.run(validation_iterator.initializer)

重置迭代器并开始验证步骤。

其他信息: Tensorflow 1.3.0 张量层1.6.5 cuda 8 cudnn 6

0 个答案:

没有答案