Question

我是使用tensorflow和ML的初学者，请原谅任何明显的错误或新手问题。

我目前正在处理对象检测问题，并且在批量大小不等于1的训练中遇到GPU上的内存容量问题。关于训练期间的GPU和CUDA信息，请参见图片picture。 / p>

我正在使用Tensorflow Github中的Faster R-CNN Inpcetion v2模型。

train.record文件为753.5 MB。

可以通过更高效的输入管道解决此问题，还是已经优化了tensorflow的github上的模型？是否应该更改网络体系结构以减少变量数量？批次大小1是获得最佳准确性的唯一/最佳选择吗？

我正在努力学习最好的方法，如果需要更多信息，请询问。

模型配置：

model {
  faster_rcnn {
    num_classes: 3
    image_resizer {
      fixed_shape_resizer {
      height: 200
      width: 200
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
     # grid_anchor_generator {
     #   scales: [0.25, 0.5, 1.0, 2.0, 3.0]
     #   aspect_ratios: [0.25,0.5, 1.0, 2.0]
     #   height_stride: 8
     #   width_stride: 8
     # }
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0, 3.0]
        aspect_ratios: [1.0, 2.0, 3.0]
        height: 64
        width: 64 
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.01
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.4
    first_stage_max_proposals: 100
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: True
        dropout_keep_probability: 0.9
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.01
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.5
        max_detections_per_class: 20
        max_total_detections: 20
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 32
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 50000
            learning_rate: .00002
          }
          schedule {
            step: 100000
            learning_rate: .000002
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0


# PATH_TO_BE_CONFIGURED: Below line needs to match location of model checkpoint: Either use checkpoint from rcnn model, or checkpoint from previously trained model on other dataset. 
  fine_tune_checkpoint: "...model.ckpt"

  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  # num_steps: 200000

  data_augmentation_options {
    random_horizontal_flip {}
  }
  data_augmentation_options {
    random_crop_image {
    min_object_covered : 1.0
    min_aspect_ratio: 0.5
    max_aspect_ratio: 2
    min_area: 0.2
    max_area: 1.
      }
  }
  data_augmentation_options {
    random_distort_color {}
  }
}



# PATH_TO_BE_CONFIGURED: Need to make sure folder structure below is correct for both train-record and label_map.pbtxt
train_input_reader: {
  tf_record_input_reader {
    input_path: "...train.record"
  }
    label_map_path: ".../label_map/label_map.pbtxt"
  queue_capacity: 500
  min_after_dequeue: 250
}



#PATH_TO_BE_CONFIGURED: Make sure folder structure for eval_export, validation.record and label_map.pbtxt below are correct. 
eval_config: {
  num_examples: 30
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
  num_visualizations: 30
  eval_interval_secs: 600
  visualization_export_dir: "...eval_export"
}



eval_input_reader: {
  tf_record_input_reader {
    input_path: "/...test.record"
  }
    label_map_path: "/...label_map.pbtxt"
  shuffle: True
  num_readers: 1
}

错误消息：

Caused by op 'CropAndResize', defined at:
  File "...models/research/object_detection/model_main.py", line 103, in <module>
    tf.app.run()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "...models/research/object_detection/model_main.py", line 99, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/...models/research/object_detection/model_lib.py", line 252, in model_fn
    preprocessed_images, features[fields.InputDataFields.true_image_shape])
  File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 680, in predict
    self._anchors.get(), image_shape, true_image_shapes))
  File "/...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 771, in _predict_second_stage
    rpn_features_to_crop, proposal_boxes_normalized))
  File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1498, in _compute_second_stage_input_feature_maps
    (self._initial_crop_size, self._initial_crop_size))
  File "/...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/ops/gen_image_ops.py", line 390, in crop_and_resize
    extrapolation_value=extrapolation_value, name=name)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,17,17,1088] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node CropAndResize (defined at ...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1498) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[node control_dependency (defined at ...models/research/object_detection/model_lib.py:345) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Answer 1

我认为您应该将batch_size行更改为： batch_size：1

更改批次大小时的OOM（Faster R-CNN Inception v2）

1 个答案: