Question

我按照以下链接使用新数据和新模型复制流程：

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

在到达最后一步之前，我使用下面的脚本激活训练作业：

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
--pipeline_config_path=gs://marksbucket0000/data/ssd_mobilenet_v1_coco.config

这项工作似乎已成功启动：

ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command

$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx

or continue streaming the logs with the command

但是，由于日志中存在以下错误，它会停止：

由于我对Google ML can和tensorflow对象检测api非常陌生，我无法从日志中找到一条线索，重新判断哪一步我做错了。

我使用的YML群集配置文件是：

trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard

如果有人至少能告诉我调试的方向，我真的很感激。非常感谢提前！

----------------关于问题的更新--------------

我实际上通过更改setup.py来实现它，如下所示：

"""Setup script for object_detection."""

from setuptools import find_packages
from setuptools import setup


# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
)

虽然我遇到了一些＆＃34;没有发现任何模块＆＃34;在运行培训工作时出现问题，有很多在线会话可以快速确定解决方案，所以我不在这里复制它们。

但是，我在运行评估工作时遇到了问题 - ＆＃34;无法导入pycocotool＆＃34;我在这里找到了解决方案：https://github.com/tensorflow/models/issues/3470

现在，我的培训和评估工作都已启动并运行。然而，似乎很奇怪的是，我无法看到任何统计数据（橙色的ex.loss情节）出现在tensorbroad的标量显示上的评估工作（但是，我确实看到了eval工作复选框显示作为视图选项）：

我还检查了eval作业中的日志，我发现节点似乎不断跳过图像。这是问题的原因吗？可能是评估数据集的一些问题？

在eval作业中记录信息：

Answer 1

并行交错功能仅适用于TensorFlow 1.5+。尝试将YAML中的行更改为：

runtimeVersion: "1.8"

任何人都可以帮我识别＆＃34; bug＆＃34;在我的Google Cloud ML培训工作中？

1 个答案: