在Google ML上运行Object Detection API时出错

时间:2018-05-23 22:29:44

标签: tensorflow google-cloud-platform google-cloud-ml

我在使用自己的培训数据在Google ML上运行作业以重新训练对象检测API SSD Mobilenet时遇到了问题。注意我可以在我的本地机器上成功训练。这是详细信息。我已经为gcloud(和相应的cloud.yaml)文件尝试了不同版本的tensorflow,但都失败了。我使用Object Detection API(+ slim)在本地运行1.8版本的tensorflow。

注意:尝试重新训练我复制到我的Google CLoud商店并最初位于object_detection \ ssd_mobilenet_v1_coco_2017_11_17 \ model.ckpt的SSD_Mobile网络模型

TensorFlow版本(使用下面的命令): 尝试了很多版本,包括1.8(不是谷歌ML支持1.8,这是本地用来制作TFRecord培训文件的版本)

注意:尝试在Google ML上运行训练示例(本地训练)。使用gcloud工具执行作业请求。遵循https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md的说明。 COMMAND从tensorflow / models / research

执行
gcloud ml-engine jobs submit training grewe_object_detection_6 --runtime-version 1.8 --job-dir=gs://BLAHBLAH-storage/Train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml -- --

描述问题 见下面的错误。试图改变使用的tensorflow的版本(当使用1.8成功运行时本地注意所以相信这是用于打包TFRecord它应该在Google ML上工作的那个) - 所以试图更新提供的cloud.yaml(试用版本1.2) ,1.4,1.6和1.8,并尝试更新模型/研究中的setup.py,但没有任何作用。

我为我的cloud.yaml文件尝试了以下内容

trainingInput: runtimeVersion: "1.8" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard

我为setup.py

尝试了以下内容

** _`""" object_detection的设置脚本。"""

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']

setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)`_**

这是Google Cloud ML控制台上的日志错误 错误消息:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 207, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error [[Node: init_ops/init_all_tables_S2 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=6383848822399600260, tensor_name="edge_29_init_ops/init_all_tables", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:GPU:0"]()]] The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train master, start_standard_services=False, config=session_config) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 734, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 406, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 490, in _try_run_local_init_op sess.run(self._local_init_op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_8&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_8%22

2 个答案:

答案 0 :(得分:0)

可以通过使用--runtime-version标志1.2(如@ iwz1992所述)并在setup.py中包含Tensorflow和Jupyter来解决该问题

答案 1 :(得分:-1)

这是一个已知问题。对象检测团队正在研究一种新的二进制文件来修复它。同时,您可以在Cloud ML Engine上使用运行时版本1.2,它应该可以工作。