我是使用tensorflow和ML的初学者,请原谅任何明显的错误或新手问题。
我目前正在处理对象检测问题,并且在批量大小不等于1的训练中遇到GPU上的内存容量问题。关于训练期间的GPU和CUDA信息,请参见图片picture。 / p>
我正在使用Tensorflow Github中的Faster R-CNN Inpcetion v2模型。
train.record文件为753.5 MB。
可以通过更高效的输入管道解决此问题,还是已经优化了tensorflow的github上的模型?是否应该更改网络体系结构以减少变量数量?批次大小1是获得最佳准确性的唯一/最佳选择吗?
我正在努力学习最好的方法,如果需要更多信息,请询问。
模型配置:
model {
faster_rcnn {
num_classes: 3
image_resizer {
fixed_shape_resizer {
height: 200
width: 200
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
# grid_anchor_generator {
# scales: [0.25, 0.5, 1.0, 2.0, 3.0]
# aspect_ratios: [0.25,0.5, 1.0, 2.0]
# height_stride: 8
# width_stride: 8
# }
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0, 3.0]
aspect_ratios: [1.0, 2.0, 3.0]
height: 64
width: 64
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.4
first_stage_max_proposals: 100
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: True
dropout_keep_probability: 0.9
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.5
max_detections_per_class: 20
max_total_detections: 20
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 32
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 50000
learning_rate: .00002
}
schedule {
step: 100000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
# PATH_TO_BE_CONFIGURED: Below line needs to match location of model checkpoint: Either use checkpoint from rcnn model, or checkpoint from previously trained model on other dataset.
fine_tune_checkpoint: "...model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
# num_steps: 200000
data_augmentation_options {
random_horizontal_flip {}
}
data_augmentation_options {
random_crop_image {
min_object_covered : 1.0
min_aspect_ratio: 0.5
max_aspect_ratio: 2
min_area: 0.2
max_area: 1.
}
}
data_augmentation_options {
random_distort_color {}
}
}
# PATH_TO_BE_CONFIGURED: Need to make sure folder structure below is correct for both train-record and label_map.pbtxt
train_input_reader: {
tf_record_input_reader {
input_path: "...train.record"
}
label_map_path: ".../label_map/label_map.pbtxt"
queue_capacity: 500
min_after_dequeue: 250
}
#PATH_TO_BE_CONFIGURED: Make sure folder structure for eval_export, validation.record and label_map.pbtxt below are correct.
eval_config: {
num_examples: 30
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
num_visualizations: 30
eval_interval_secs: 600
visualization_export_dir: "...eval_export"
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/...test.record"
}
label_map_path: "/...label_map.pbtxt"
shuffle: True
num_readers: 1
}
错误消息:
Caused by op 'CropAndResize', defined at:
File "...models/research/object_detection/model_main.py", line 103, in <module>
tf.app.run()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "...models/research/object_detection/model_main.py", line 99, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/...models/research/object_detection/model_lib.py", line 252, in model_fn
preprocessed_images, features[fields.InputDataFields.true_image_shape])
File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 680, in predict
self._anchors.get(), image_shape, true_image_shapes))
File "/...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 771, in _predict_second_stage
rpn_features_to_crop, proposal_boxes_normalized))
File "...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1498, in _compute_second_stage_input_feature_maps
(self._initial_crop_size, self._initial_crop_size))
File "/...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/ops/gen_image_ops.py", line 390, in crop_and_resize
extrapolation_value=extrapolation_value, name=name)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "...anaconda3/envs/tf_imagerecog_new/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2048,17,17,1088] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node CropAndResize (defined at ...models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:1498) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[node control_dependency (defined at ...models/research/object_detection/model_lib.py:345) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
答案 0 :(得分:1)
我认为您应该将batch_size行更改为: batch_size:1