Question

我试图根据Tensorflow样本和this post训练自己的Detector模型。我确实在Macbook Pro上进行了本地培训。问题是我没有GPU并且在CPU上执行它太慢（每次迭代大约25秒）。

这样，我尝试在tutorial之后运行Google Cloud ML Engine，但我无法正常运行。

我的文件夹结构如下所述：

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

我从本地培训改为Google云培训的步骤是：

在Google云端存储中创建一个存储桶，并使用文件复制我的本地文件夹结构;
修改我的pipeline.config文件并将所有路径从Users/dev/detector/更改为gcc://bucketname/;
使用教程中提供的默认配置创建YAML文件;
运行

gcloud ml-engine工作提交培训object_detection _ date +%s \ --job-dir = gs：// bucketname / models / train \ --packages dist / object_detection-0.1.tar.gz，slim / dist / slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-east1 \ --config /Users/dev/detector/training/cloud.yml \ - --train_dir = gs：// bucketname / models / train \ --pipeline_config_path = GS：//bucketname/data/pipeline.config

这样做，从MLUnits给我以下错误消息：

副本ps 0以非零状态1退出。终止原因：错误。回溯（最近一次调用最后一次）：文件＆＃34; /usr/lib/python2.7/runpy.py"，第162行，在_run_module_as_main＆＃34; __ main __＆＃34;，fname，loader，pkg_name）文件＆＃34; /usr/lib/python2.7/runpy.py"，第72行，在run_globals文件中的_run_code exec代码＆＃34; /root/.local/lib/python2.7/site-packages/ object_detection / train.py＆＃34;，第49行，来自object_detection导入培训师文件＆＃34; /root/.local/lib/python2.7/site-packages/object_detection/trainer.py" ;,第27行，从object_detection.builders导入preprocessor_builder文件＆＃34; /root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py" ;,第21行，从object_detection.protos导入preprocessor_pb2文件＆＃34; /root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py" ;,第71行，in options = None，file = DESCRIPTOR），TypeError：__ new __（）got意外的关键字参数＆＃39;文件

提前致谢。

Answer 1

检查andersskog发布的here解决方案。它对我有用。我做了一个补丁here。如需手动修复，请按照以下说明操作：

确保您的yaml版本为1.4，例如：

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

将setup.py更改为以下内容：

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

在object_detection / utils / visualization_utils.py中，第24行（在导入matplotlib.pyplot作为plt之前）添加：

import matplotlib
matplotlib.use('agg')

在object_detection / evaluator.py的第184行中，更改

tf.train.get_or_create_global_step()

到

tf.contrib.framework.get_or_create_global_step()

最后，在object_detection / builders / optimizer_builder.py的第103行中，更改

tf.train.get_or_create_global_step()

到

tf.contrib.framework.get_or_create_global_step()

希望这有帮助！

Answer 2

问题是protobuf版本。你可能已经通过brew安装了最新的protoc;自3.5.0版以来，protobuf添加了file字段https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74

因此，在上述更改中，REQUIRED_PACKAGES将protobuf版本设置为'protobuf>=3.5.1'

无法在Google Cloud中训练我的Tensorflow探测器模型

2 个答案: