Question

我尝试使用gcloud ml-engine训练对象检测模型，参考官方文档https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md，并设置runtime-version = 1.4，并引用此问题https://github.com/tensorflow/models/issues/2739进行修改setup.py，但有错误：

工人复制品-3- 2018-01-09 06：32：39.416080：I tensorflow / core / platform / cpu_feature_guard.cc：137]您的CPU支持未编译此TensorFlow二进制文件的指令：SSE4.1 SSE4.2 AVX

KER-复制品-3- grpc epoll fd：3

{
insertId: "1fwigqcg5k37j2o"
jsonPayload: {
created: 1515479559.41658
levelname: "ERROR"
lineno: 1051
message: " grpc epoll fd: 3"
pathname: "ev_epoll1_linux.c"
thread: 917
}

最后一条错误消息是：

The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

我使用以下命令在Cloud ML Engine上启动培训作业：

gcloud ml-engine jobs submit training object_detection_training_date +%s \
--job-dir=gs://mybucket/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region asia-east1 \
--config object_detection/samples/cloud/cloud.yml \
-- \
--train_dir=gs://mybucket/train \
--pipeline_config_path=gs://mybucket/data/ssd_mobilenet_v1_coco.config \
--runtime-version 1.4

Answer 1

目前仅支持运行时版本1.2。我们正在开发其他版本。

Answer 2

FYI该日志消息不是ERROR。去年八月它被降级为grpc代码库中的INFO日志。

在谷歌ML引擎中运行对象检测训练时出错 - grpc epoll fd：3

2 个答案: