Cloud ML引擎作业无法终止,始终停留在“任务成功完成”中。

时间:2019-05-04 09:20:30

标签: python google-cloud-platform google-cloud-ml

我创建了一个分类作业,可以使用云控制台在带有MNIST数据集的Google Cloud ML Engine上运行。

gcloud ml-engine jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=./trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=STANDARD_1 \
    --runtime-version 1.12 \
    -- \
    --bucket=${BUCKET} \
    --output_dir=${OUTDIR} \
    --train_steps=10000

当作业日志显示“任务成功完成”时,培训结束后我的培训作业不会终止。如果我们不终止作业,它将无限期地在ML引擎上运行。

I  worker-replica-3 loss = 0.00013220838, step = 9742 (37.280 sec)
I  worker-replica-0 global_step/sec: 9.33984
I  worker-replica-0 global_step/sec: 9.9101
I  worker-replica-3 Loss for final step: 0.0017736279.
I  worker-replica-1 Loss for final step: 0.0010084368. 
I  worker-replica-3 Module completed; cleaning up.
I  worker-replica-3 Clean up finished.
I  worker-replica-3 Task completed successfully.
I  worker-replica-2 Loss for final step: 0.0028514725.
I  worker-replica-1 Module completed; cleaning up.
I  worker-replica-1 Clean up finished.
I  worker-replica-1 Task completed successfully.
I  worker-replica-0 Loss for final step: 0.0015272798.
I  worker-replica-2 Module completed; cleaning up.
I  worker-replica-2 Clean up finished.
I  worker-replica-2 Task completed successfully.
I  worker-replica-0 Module completed; cleaning up.
I  worker-replica-0 Clean up finished.
I  worker-replica-0 Task completed successfully.

如果我将机器类型从scale-tier = STANDARD_1更改为BASIC_GPU,则这样:

gcloud ml-engine jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=./trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=BASIC_GPU \
    --runtime-version 1.12 \
    -- \
    --bucket=${BUCKET} \
    --output_dir=${OUTDIR} \
    --train_steps=10000

这将是与此报告why-does-google-cloude-ml-training-job-give-zero-utilization-stats-for-hour

相同的问题

我还尝试过不同版本的TensorFlow仍然遇到相同的问题。

0 个答案:

没有答案