Question

当我使用google cloud bucket作为数据源和目的地进行本地培训时，

gcloud ml-engine local train --module-name trainer.task_v2s --package-path trainer/

由于我的数据集是400个示例，并且我使用20作为批处理大小，因此我得到了正常的结果，并且检查点在20秒内得到了正确的保存：400/20 = 20步= 1个纪元。这些文件保存在存储桶的模型目录中

model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-20.data-00000-of-00001
model.ckpt-20.index
model.ckpt-20.meta

此外，我的本地GPU已正确启用：

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1018      G   /usr/lib/xorg/Xorg                           212MiB |
|    0      1889      G   compiz                                        69MiB |
|    0      5484      C   ...rtualenvs/my_project/bin/python  2577MiB         |
+-----------------------------------------------------------------------------+

当我现在尝试使用gcloud计算单元时：

gcloud ml-engine jobs submit training my_job_name \
--module-name trainer.task_v2s --package-path trainer/ \
--staging-bucket gs://my-bucket --region europe-west1 \
--scale-tier BASIC_GPU --runtime-version 1.8 --python-version 3.5

保存检查点大约需要花费相同的时间，但是尽管数据源没有更改，但是它以1步的增量进行保存。损失也越来越慢，就像只训练一个例子一样。文件的外观如下：

model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-1.data-00000-of-00001
model.ckpt-1.index
model.ckpt-1.meta

GPU也完全没有参与：

+-----------------------------------------------------------------------------+  
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |      
+-----------------------------------------------------------------------------+

我使用的是没有配置clusterspec的自定义估算器，因为我假设您只需要使用分布式估算，并且run_config如下所示：

使用配置：{'_master'：''，'_num_ps_replicas'：0，'_session_config'：无，'_task_id'：0，'_model_dir'：'gs：// my_bucket / model_dir'，'_save_checkpoints_steps'：无，'_tf_random_seed'：无，'_task_type'：'master'，'_keep_checkpoint_max'：5，'_evaluation_master'：''，'_device_fn'：无，'_save_checkpoints_secs'：600，'_save_summary_steps'：100，'_cluster_specs ：，'_log_step_count_steps'：100，'_is_chief'：True，'_global_id_in_cluster'：0，'_num_worker_replicas'：1，'_service'：无，'_keep_checkpoint_every_n_hours'：10000，'_train_distribute'：无}

从日志中，我还可以看到TF_CONFIG环境变量：

{'environment'：'cloud'，'cluster'：{'master'：['127.0.0.1:2222']}，'job'：{'python_version'：'3.5'，'run_on_raw_vm'：True ，'package_uris'：['gs：//my-bucket/my-project10/27cb2041a4ae5a14c18d6e7f8622d9c20789e3294079ad58ab5211d8e09a2669/MyProject-0.9.tar.gz']，“ runtime_version”：“ 1.8”，“ python_task”：“ trainer”。 scale_tier'：'BASIC_GPU'，'region'：'europe-west1'}，'task'：{'cloud'：'qc6f9ce45ab3ea3e9-ml'，'type'：'master'，'index'：0}}}

我的猜测是我需要配置一些我尚未配置的东西，但我不知道该怎么做。一开始我也确实得到了一些警告，但我认为它们与此无关：

google-cloud-vision 0.29.0具有要求请求<3.0dev，> = 2.18.4，但您将拥有不兼容的请求2.13.0。

Answer 1

我刚发现我的错误：我需要在我的setup.py中放入tensorflow-gpu而不是tensorflow。如rhaertel80所述，甚至更好的是一起省略所有张量流。

使用gcloud计算单元而非本地

1 个答案: