在云计算机学习引擎上运行时,Tensorflow对象检测train.py失败

时间:2017-10-17 22:11:48

标签: machine-learning tensorflow google-cloud-platform object-detection

我有一个在本地工作的tensorflow对象检测api的小工作示例。一切看起来都很棒。我的目标是使用他们的脚本在谷歌机器学习引擎中运行,这在过去我已经广泛使用过。我正在关注这些docs

声明一些相关变量

declare PROJECT=$(gcloud config list project --format "value(core.project)")
declare BUCKET="gs://${PROJECT}-ml"
declare MODEL_NAME="DeepMeerkatDetection"
declare FOLDER="${BUCKET}/${MODEL_NAME}"
declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"
declare TRAIN_DIR="${FOLDER}/${JOB_ID}"
declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"
declare  PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"
declare  PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"

我的yaml看起来像

trainingInput:
  runtimeVersion: "1.0"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

相关路径在配置中设置,例如

  fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"

我使用setup.py

打包了对象检测和细长

运行

gcloud ml-engine jobs submit training "${JOB_ID}_train" \
    --job-dir=${TRAIN_DIR} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PIPELINE_YAML} \
    -- \
    --train_dir=${TRAIN_DIR} \
    --pipeline_config_path= ${PIPELINE_CONFIG_PATH}

产生张量流(导入?)错误。它有点神秘

insertId:  "1inuq6gg27fxnkc"  
 logName:  "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train"  
 receiveTimestamp:  "2017-10-17T21:38:34.435293164Z"  
 resource: {…}  
 severity:  "ERROR"  
 textPayload:  "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main
    model_config, train_config, input_config = get_configs_from_multiple_files()
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files
    text_format.Merge(f.read(), train_config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
    return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
FailedPreconditionError: .

我在与机器学习引擎预测相关的其他questions中看到了这个错误,暗示这个错误可能(?)与对象检测代码没有直接关系,但感觉它不是打包正确,缺少依赖项?我已将gcloud更新为最新版本。

Bens-MacBook-Pro:research ben$ gcloud --version
Google Cloud SDK 175.0.0
bq 2.0.27
core 2017.10.09
gcloud 
gsutil 4.27

很难看出它与这个问题的关系如何

FailedPreconditionError when running TF Object Detection API with own model

为什么代码需要在云中进行不同的初始化?

更新#1。

奇怪的是,eval.py工作正常,因此它不能成为配置文件的路径,或者是train.py和eval.py共享的任何内容。 Eval.py耐心地坐下来等待创建模型检查点。

enter image description here

另一个想法可能是检查点在上传过程中以某种方式被破坏了。我们可以从头开始测试这种绕过和训练。

在.config

  from_detection_checkpoint: false

产生相同的前提条件错误,因此它不能成为模型。

1 个答案:

答案 0 :(得分:0)

根本原因是一个小错误:

--pipeline_config_path= ${PIPELINE_CONFIG_PATH}

有一个额外的空间。试试这个:

gcloud ml-engine jobs submit training "${JOB_ID}_train" \
    --job-dir=${TRAIN_DIR} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PIPELINE_YAML} \
    -- \
    --train_dir=${TRAIN_DIR} \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH}