Google Cloud ml-engine上的错误训练Tensorflow对象检测模型

时间:2018-12-24 13:44:14

标签: python tensorflow google-cloud-ml

我正在尝试在自定义数据集上训练对象检测模型,但是当我在ml-engine上排队训练作业时遇到错误。几分钟后,作业失败。这是我用来启动这项工作的命令:

gcloud ml-engine jobs submit training mayemene_malaria_detector_29122018  --job-dir=${YOUR_GCS_BUCKET}/train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/absl-py-0.6.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz --module-name object_detection.model_main --region europe-west1 --config object_detection/samples/cloud/cloud.yml --runtime-version=1.12 -- --model_dir=${YOUR_GCS_BUCKET}/train --pipeline_config_path=${YOUR_GCS_BUCKET}/data/ssd_mobilenet_v1_coco.config

这是ml-engine日志的快照:

174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/model_main.py", line 109, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 637, in run getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 647, in run_worker return self._start_distributed_training() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/model_lib.py", line 307, in model_fn include_global_step=False)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 126, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 326, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__ c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://uganda-plasmodium-training-ml/model.ckpt

这很可能是由于ssd_mobilenet_v1_coco.config

中的以下行

fine_tune_checkpoint: "gs://uganda-plasmodium-training-ml/model.ckpt"

更改此行以匹配预先训练的移动网络检查点所在的存储桶 有所作为。请有人能告诉我为什么会出现此错误吗?

0 个答案:

没有答案