使用TensorFlow 1.9时CloudML作业不会终止

时间:2018-08-14 23:36:46

标签: google-cloud-platform google-cloud-ml

使用TF 1.9(即officially supported)进行培训后,我们的CloudML培训工作不会终止。乔布斯无限期地坐在那里。有趣的是,在TF 1.8上运行的CloudML作业没有问题。我们的模型是通过tf.Estimator创建的。

典型日志(使用TF <= 1.8时)为:

I  Job completed successfully.
I  Finished tearing down training program. 
I  ps-replica-0 Clean up finished.  ps-replica-0
I  ps-replica-0 Module completed; cleaning up.  ps-replica-0
I  ps-replica-0 Signal 15 (SIGTERM) was caught. Terminated by service. 
This is normal behavior.  ps-replica-0
I  Tearing down training program. 
I  master-replica-0 Task completed successfully.  master-replica-0
I  master-replica-0 Clean up finished.  master-replica-0
I  master-replica-0 Module completed; cleaning up.  master-replica-0
I  master-replica-0 Loss for final step: 0.054428928.  master-replica-0
I  master-replica-0 SavedModel written to: XXX  master-replica-0

使用TF 1.9时,我们看到以下内容:

I  master-replica-0 Skip the current checkpoint eval due to throttle secs (30 secs). master-replica-0 
I  master-replica-0 Saving checkpoints for 20034 into gs://bg-dataflow/yuri/nine_gag_recommender_train_test/trained_model/model.ckpt. master-replica-0 
I  master-replica-0 global_step/sec: 17.7668 master-replica-0 
I  master-replica-0 SavedModel written to: XXX master-replica-0 

有什么想法吗?

1 个答案:

答案 0 :(得分:4)

检查发送给您的作业ID中的日志后,似乎只有一半的工人完成了任务,另一半被卡住了,因此主人正在等待他们还活着,这导致您的工作被卡住了。

默认情况下,使用tf.Estimator时,主服务器会等待所有工作人员都存活下来。在具有许多工人的大规模分布式培训中,设置device_filters很重要,这样master只能依靠PS存活,同样,工人也应该仅依靠PS存活。

解决方案是在tf.ConfigProto()中设置设备筛选器,并将其传递给tf.estimator.RunConfig()的session_config参数。 您可以在这里找到更多详细信息:https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#set-device-filters