我创建了一个分类作业,可以使用云控制台在带有MNIST数据集的Google Cloud ML Engine上运行。
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=./trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=STANDARD_1 \
--runtime-version 1.12 \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--train_steps=10000
当作业日志显示“任务成功完成”时,培训结束后我的培训作业不会终止。如果我们不终止作业,它将无限期地在ML引擎上运行。
I worker-replica-3 loss = 0.00013220838, step = 9742 (37.280 sec)
I worker-replica-0 global_step/sec: 9.33984
I worker-replica-0 global_step/sec: 9.9101
I worker-replica-3 Loss for final step: 0.0017736279.
I worker-replica-1 Loss for final step: 0.0010084368.
I worker-replica-3 Module completed; cleaning up.
I worker-replica-3 Clean up finished.
I worker-replica-3 Task completed successfully.
I worker-replica-2 Loss for final step: 0.0028514725.
I worker-replica-1 Module completed; cleaning up.
I worker-replica-1 Clean up finished.
I worker-replica-1 Task completed successfully.
I worker-replica-0 Loss for final step: 0.0015272798.
I worker-replica-2 Module completed; cleaning up.
I worker-replica-2 Clean up finished.
I worker-replica-2 Task completed successfully.
I worker-replica-0 Module completed; cleaning up.
I worker-replica-0 Clean up finished.
I worker-replica-0 Task completed successfully.
如果我将机器类型从scale-tier = STANDARD_1更改为BASIC_GPU,则这样:
gcloud ml-engine jobs submit training $JOBNAME \
--region=$REGION \
--module-name=trainer.task \
--package-path=./trainer \
--job-dir=$OUTDIR \
--staging-bucket=gs://$BUCKET \
--scale-tier=BASIC_GPU \
--runtime-version 1.12 \
-- \
--bucket=${BUCKET} \
--output_dir=${OUTDIR} \
--train_steps=10000
这将是与此报告why-does-google-cloude-ml-training-job-give-zero-utilization-stats-for-hour
相同的问题我还尝试过不同版本的TensorFlow仍然遇到相同的问题。