我正在关注以下Tensorflow对象检测教程
Tensorflow Object Detection。我已从以下GitHub链接Object Models下载了模型。我正在尝试使用自定义数据检测花朵。我运行TFRecords
时,train.py
和所有标签都已创建,并在上面的教程中显示
python train.py --logtostderr --train_dir=training\ --pipeline_config_path=training\ssd_mobilenet_v1_pets.config
。
我收到以下错误:
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2018-02-09 01:45:59.841297: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
INFO:tensorflow:Restoring parameters from ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[24,1,3648,5472,3]
[[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_BOOL, DT_INT32, DT_FLOAT, DT_INT32, DT_STRING, DT_INT32, DT_STRING, DT_INT32], timeout_ms=-1, _device="/job:localho
st/replica:0/task:0/device:CPU:0"](batch/padding_fifo_queue, batch/n)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
INFO:tensorflow:Recording summary at step 0.
Traceback (most recent call last):
File "train.py", line 164, in <module>
tf.app.run()
File "C:\Users\ML\AppData\Local\Programs\Python\Python35\lib\site-packages \tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 160, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "C:\Users\Ashwin\Desktop\game making\OpenCV\Tensorflow Object Detection\models\research\object_detection\trainer.py", line 332, in train
saver=saver)
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 775, in train
sv.stop(threads, close_summary_writer=True)
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\training\queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\client\session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "C:\Users\Ashwin\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[24,1,3648,5472,3]
[[Node: batch = QueueDequeueManyV2[component_types=[DT_STRING, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_INT64, DT_INT32, DT_BOOL, DT_INT32, DT_FLOAT, DT_INT32, DT_STRING, DT_INT32, DT_STRING, DT_INT32], timeout_ms=-1, _device="/job:localho
st/replica:0/task:0/device:CPU:0"](batch/padding_fifo_queue, batch/n)]]
这些行似乎表明错误的性质:
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[24,1,3648,5472,3]
INFO:tensorflow:Caught **OutOfRangeError**. Stopping Training.
我查看了GitHub页面,其中删除num_epochs
似乎解决了OutOfRange
错误,但我无法在从repo下载的train.py
中找到此类参数。这是我第一次使用TensorFlow,我无法完全理解这些机制。
我的数据集标签大小为:训练标签中的124张图片和测试标签中的93张图片。
答案 0 :(得分:2)
您的OOM
问题(内存不足的首字母缩写)看起来是Github issues of Tensorflow repository上讨论的已知问题。
似乎每个人并不总是同样的问题。我将尝试列举最流行的解决方案。
减少batch_size
文件的批量大小.config
。请记住,处理需要更长的时间。
释放机器上的内存(RAM)。我不知道你在哪个操作系统上,但我想 Linux发行版。 Here is a StackExchange question 在它上面。
Making more checkpoint,这意味着在经过训练的模型崩溃并恢复之前更频繁地保存它。
当然,这里的解决方案都不是最优的,真正的解决方案是拥有更强大的GPU供您使用。
如果您有权访问它,您还可以考虑使用云资源(最受欢迎的是 AWS 和 Azure )但它可能是昂贵的。
答案 1 :(得分:0)
如果是自定义数据,则必须使用自定义配置文件(特别是您需要更改其中的类数)。
您还会在其中找到num_epochs
,以及培训和测试的批量大小(减少它将降低获得OOM错误的风险)。看起来你在CPU上运行,你需要确保你有足够的RAM来运行你的训练。否则你将不得不使用更小的网络,但MobileNets上的SSD已经非常小,所以这并不容易......