我正在跟踪谷歌的Object Detection API重新训练我自己的数据集,但遇到了一系列问题。
其中一项如下:
"Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 290, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 776, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 713, in prepare_or_wait_for_session
max_wait_secs=max_wait_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 387, in wait_for_session
is_ready, not_ready_msg = self._model_ready(sess)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 435, in _model_ready
return _ready(self._ready_op, sess, "Model not ready")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 492, in _ready
ready_value = sess.run(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: {"created":"@1502405189.800982817","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}
"
pathname: "/var/sitecustomize/sitecustomize.py"
}
我不太确定grpc是什么 - 所以我对这个错误感到很沮丧。 任何可以帮助它的人都会很棒!! 谢谢!
答案 0 :(得分:1)
这可能是内存不足错误(请参阅this question)。
您可以尝试使用更大的机器类型,特别是对于主机,例如large_model
,complex_model_l
或complex_model_l_gpu
。您可以通过将文件传递到--config
参数gcloud
来执行此操作,其内容类似于以下内容:
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: complex_model_l_gpu
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard