使用Object Detection API后,在一些列车步骤之后的tensorflow OOM

时间:2017-10-24 12:44:43

标签: python tensorflow object-detection object-detection-api

我使用Google Object Detection API训练自己的对象检测模型。一切都还可以,就像这样训练

2017-10-24 17:40:50.579603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 1050 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.392
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.55GiB
2017-10-24 17:40:50.579617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-10-24 17:40:50.579621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-10-24 17:40:50.579627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
2017-10-24 17:40:51.234252: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training/model/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 14.9167 (3.799 sec/step)
INFO:tensorflow:global step 2: loss = 12.3885 (1.003 sec/step)
INFO:tensorflow:global step 3: loss = 11.5575 (0.825 sec/step)
2017-10-24 17:41:00.695594: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 7141 get requests, put_count=7131 evicted_count=1000 eviction_rate=0.140233 and unsatisfied allocation rate=0.15544
2017-10-24 17:41:00.695684: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:global step 4: loss = 10.8721 (0.772 sec/step)
INFO:tensorflow:global step 5: loss = 10.2290 (0.790 sec/step)
INFO:tensorflow:global step 6: loss = 9.5224 (0.799 sec/step)
INFO:tensorflow:global step 7: loss = 9.3629 (0.797 sec/step)
INFO:tensorflow:global step 8: loss = 9.1755 (0.847 sec/step)
INFO:tensorflow:global step 9: loss = 8.3156 (0.788 sec/step)
INFO:tensorflow:global step 10: loss = 8.2479 (0.817 sec/step)
INFO:tensorflow:global step 11: loss = 7.8164 (0.762 sec/step)
INFO:tensorflow:global step 12: loss = 7.5391 (0.769 sec/step)
INFO:tensorflow:global step 13: loss = 6.9219 (0.790 sec/step)
INFO:tensorflow:global step 14: loss = 6.9487 (0.781 sec/step)
INFO:tensorflow:global step 15: loss = 6.6061 (0.793 sec/step)
INFO:tensorflow:global step 16: loss = 6.3786 (0.813 sec/step)
INFO:tensorflow:global step 17: loss = 6.1362 (0.757 sec/step)
INFO:tensorflow:global step 18: loss = 6.1345 (0.766 sec/step)
INFO:tensorflow:global step 19: loss = 6.3627 (0.754 sec/step)
INFO:tensorflow:global step 20: loss = 6.1240 (0.775 sec/step)
INFO:tensorflow:global step 21: loss = 6.0264 (0.750 sec/step)
INFO:tensorflow:global step 22: loss = 5.6904 (0.747 sec/step)
INFO:tensorflow:global step 23: loss = 4.7453 (0.751 sec/step)
INFO:tensorflow:global step 24: loss = 4.7063 (0.766 sec/step)
INFO:tensorflow:global step 25: loss = 5.0677 (0.828 sec/step)

但经过一些步骤后,发生了OOM错误。

INFO:tensorflow:global step 5611: loss = 1.2254 (0.780 sec/step)
INFO:tensorflow:global step 5612: loss = 0.8521 (0.755 sec/step)
INFO:tensorflow:global step 5613: loss = 1.5406 (0.786 sec/step)
INFO:tensorflow:global step 5614: loss = 1.3886 (0.748 sec/step)
INFO:tensorflow:global step 5615: loss = 1.2802 (0.740 sec/step)
INFO:tensorflow:global step 5616: loss = 0.9879 (0.755 sec/step)
INFO:tensorflow:global step 5617: loss = 0.9560 (0.774 sec/step)
INFO:tensorflow:global step 5618: loss = 1.0467 (0.755 sec/step)
INFO:tensorflow:global step 5619: loss = 1.2808 (0.763 sec/step)
INFO:tensorflow:global step 5620: loss = 1.3788 (0.753 sec/step)
INFO:tensorflow:global step 5621: loss = 1.1395 (0.727 sec/step)
INFO:tensorflow:global step 5622: loss = 1.2390 (0.751 sec/step)
2017-10-24 18:53:05.076122: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB.  Current allocation summary follows.
2017-10-24 18:53:05.076191: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256):   Total Chunks: 2, Chunks in use: 0 512B allocated for chunks. 8B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076214: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076245: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024):  Total Chunks: 1, Chunks in use: 0 1.0KiB allocated for chunks. 4B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076276: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048):  Total Chunks: 4, Chunks in use: 0 8.0KiB allocated for chunks. 5.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076299: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076324: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-req

我发现可能是由于多gpu培训。

Caused by op 'Loss/ToInt32_60', defined at:
  File "train.py", line 205, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train.py", line 201, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/yuxin/Project/my_object_detection/object_detection/trainer.py", line 192, in train
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
  File "/home/yuxin/Project/my_object_detection/slim/deployment/model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
  File "/home/yuxin/Project/my_object_detection/object_detection/trainer.py", line 133, in _create_losses
    losses_dict = detection_model.loss(prediction_dict)
  File "/home/yuxin/Project/my_object_detection/object_detection/meta_architectures/ssd_meta_arch.py", line 431, in loss
    location_losses, cls_losses, prediction_dict, match_list)
  File "/home/yuxin/Project/my_object_detection/object_detection/meta_architectures/ssd_meta_arch.py", line 565, in _apply_hard_mining
    match_list=match_list)
  File "/home/yuxin/Project/my_object_detection/object_detection/core/losses.py", line 479, in __call__
    self._min_negatives_per_image)
  File "/home/yuxin/Project/my_object_detection/object_detection/core/losses.py", line 541, in _subsample_selection_to_desired_neg_pos_ratio
    num_positives = tf.reduce_sum(tf.to_int32(positives_indicator))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 770, in to_int32
    return cast(x, dtypes.int32, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 689, in cast
    return gen_math_ops.cast(x, base_type, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 403, in cast
    result = _op_def_lib.apply_op("Cast", x=x, DstT=DstT, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1917]
     [[Node: Loss/ToInt32_60 = Cast[DstT=DT_INT32, SrcT=DT_BOOL, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/Gather_220/_8451)]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1917]
     [[Node: Loss/ToInt32_60 = Cast[DstT=DT_INT32, SrcT=DT_BOOL, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/Gather_220/_8451)]]

我使用Object Detection API代码进行训练。我只是想用单GPU来训练。

with tf.Graph().as_default():
    # Build a configuration specifying multi-GPU and multi-replicas.
    deploy_config = model_deploy.DeploymentConfig(
        num_clones=num_clones,
        clone_on_cpu=clone_on_cpu,
        replica_id=task,
        num_replicas=worker_replicas,
        num_ps_tasks=ps_tasks,
        worker_job_name=worker_job_name)

    # Place the global step on the device storing the variables.
    with tf.device(deploy_config.variables_device()):
      global_step = slim.create_global_step()

    with tf.device(deploy_config.inputs_device()):
      input_queue = _create_input_queue(train_config.batch_size // num_clones,
                                        create_tensor_dict_fn,
                                        train_config.batch_queue_capacity,
                                        train_config.num_batch_queue_threads,
                                        train_config.prefetch_queue_capacity,
                                        data_augmentation_options)

    # Gather initial summaries.
    summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))
    global_summaries = set([])

    model_fn = functools.partial(_create_losses,
                                 create_model_fn=create_model_fn)
    clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
    first_clone_scope = clones[0].scope

    # Gather update_ops from the first clone. These contain, for example,
    # the updates for the batch_norm variables created by model_fn.
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)

我知道减少BatchSize可以解决它。但是为什么训练开始没问题,经过一些步骤就发生了OOM错误。 非常感谢你。

1 个答案:

答案 0 :(得分:1)

您能提供配置和培训文件吗?

我们经常发现遇到OOM问题的用户拥有大分辨率的输入图像。在TFRecord中将图像预缩小到较小的尺寸有助于避免这些问题。