无法扩展广泛而深入的模型以在Google Cloud ML上进行培训

时间:2017-10-23 13:20:04

标签: tensorflow google-cloud-ml google-cloud-ml-engine

我试图建立一个广泛而深入的张量流模型并在谷歌云上进行训练。

我已经能够做到并训练较小的开发版本。

然而,我现在正在努力扩展到更多数据和更多培训步骤,我的在线培训工作仍然失败。

它运行5分钟左右然后我得到以下错误:

The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. 
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_123542&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_123542%22

当我查看日志时,我可以看到这些错误似乎是个问题:

Command '['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1

我不确定这里发生了什么。我觉得这可能与我训练模型的机器类型有关。但我已经尝试从" STANDARD_1"到" PREMIUM_1"我也尝试使用" complex_model_l"的自定义机器类型。和" large_model"对于参数服务器。

我正在使用的数据中有大约1400个功能。

我只在一天的数据培训1000步,我已经减少了批量大小。我可以像这样在本地训练但是当我尝试在云中训练它时(即使步数很少)我也会遇到这个错误。

我不确定接下来要尝试什么......

看起来gsutil命令可能正在将模型的打包版本复制到本地工作者并导致问题。没想到宽大而深的模型的1400个功能足以让我担心我的模型太大了。所以我并不确定我是在想我正在进行的事情,因为我希望使用其他机器类型和自定义配置可以解决这个问题。

P.S。这是我正在使用的自定义配置的yaml:

trainingInput:
  scaleTier: CUSTOM
  masterType: large_model
  workerType: large_model
  parameterServerType: large_model
  workerCount: 15
  parameterServerCount: 10

我培训模型的呼吁就像:

  gcloud ml-engine jobs submit training $JOB_NAME \
    --stream-logs \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.2 \
    --config $CONFIG \
    --module-name trainer.task \
    --package-path $PACKAGE_PATH \
    --region $REGION \
    --scale-tier CUSTOM \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 1000 \
    --verbosity DEBUG  \
    --eval-steps 100 \
    --num-layers 2 \
    --first-layer-size 200 \
    --scale-factor 0.99
上面的 $ OUTPUT_PATH只是一天的数据 - 所以我非常确定我的问题不在于输入和步骤行数据太多。我的批量大小也是100。

更新 我运行了一个超级协议调优工作,实际上就是之前为我工作的东西。这是工作信息:

clickmodel_train_20171023_154805

 Failed (10 min 19 sec)
Creation time   
Oct 23, 2017, 4:48:08 PM
Start time  
Oct 23, 2017, 4:48:12 PM
End time    
Oct 23, 2017, 4:58:27 PM
Logs    
View logs
Error message   
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: num-layers=11, scale-factor=0.47899098586647881, first-layer-size=498, . The trial's error message was: The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Training input  
{
  "scaleTier": "CUSTOM",
  "masterType": "large_model",
  "workerType": "standard_gpu",
  "parameterServerType": "large_model",
  "workerCount": "10",
  "parameterServerCount": "5",
  "packageUris": [
    "gs://pmc-ml/clickmodel/vy/output/packages/326616fb7bab86d0d534c03f3260a0ff38c86112850b478ba28eca1e9d12d092/trainer-0.0.0.tar.gz"
  ],
  "pythonModule": "trainer.task",
  "args": [
    "--train-files",
    "gs://pmc-ml/clickmodel/vy/data/train_data_20170901*.csv",
    "--eval-files",
    "gs://pmc-ml/clickmodel/vy/data/dev_data_20170901*.csv",
    "--train-steps",
    "1000",
    "--verbosity",
    "DEBUG",
    "--eval-steps",
    "100",
    "--num-layers",
    "2",
    "--first-layer-size",
    "200",
    "--scale-factor",
    "0.99",
    "--train-batch-size",
    "100",
    "--eval-batch-size",
    "100"
  ],
  "hyperparameters": {
    "goal": "MAXIMIZE",
    "params": [
      {
        "parameterName": "first-layer-size",
        "minValue": 50,
        "maxValue": 500,
        "type": "INTEGER",
        "scaleType": "UNIT_LINEAR_SCALE"
      },
      {
        "parameterName": "num-layers",
        "minValue": 1,
        "maxValue": 15,
        "type": "INTEGER",
        "scaleType": "UNIT_LINEAR_SCALE"
      },
      {
        "parameterName": "scale-factor",
        "minValue": 0.1,
        "maxValue": 1,
        "type": "DOUBLE",
        "scaleType": "UNIT_REVERSE_LOG_SCALE"
      }
    ],
    "maxTrials": 12,
    "maxParallelTrials": 2,
    "hyperparameterMetricTag": "accuracy"
  },
  "region": "us-central1",
  "runtimeVersion": "1.2",
  "jobDir": "gs://pmc-ml/clickmodel/vy/output"
}

但我现在遇到错误:

16:58:06.188
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Expand all | Collapse all {
 insertId:  "w77g2yg1zqa5fl"  
 logName:  "projects/pmc-analytical-data-mart/logs/ml.googleapis.com%2Fclickmodel_train_20171023_154805"  
 receiveTimestamp:  "2017-10-23T15:58:06.188221966Z"  
 resource: {…}  
 severity:  "ERROR"  
 textPayload:  "The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module>
    tf.gfile.DeleteRecursively(args.job_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
    pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
PermissionDeniedError: could not fully delete dir

To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22"  
 timestamp:  "2017-10-23T15:58:06.188221966Z"  
}

看起来确实可能是一种许可。我已将cloud-logs@google.com,cloud-ml-service@pmc-analytical-data-mart-8c548.iam.gserviceaccount.com,cloud-ml@google.com作为管理员添加到pmc-ml存储桶中。我想知道还有什么我想念的。

另一个更新

我现在也在日志中看到这些错误,但不确定它们是否相关:

{
 insertId:  "1986fw7g2uya0b9"  
 jsonPayload: {
  created:  1508774246.95985   
  levelname:  "ERROR"   
  lineno:  335   
  message:  "2017-10-23 15:57:26.959642: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"   
  pathname:  "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "7863680028519935658"   
  compute.googleapis.com/resource_name:  "worker-f13b3addb0-7-s6dxq"   
  compute.googleapis.com/zone:  "us-central1-c"   
  ml.googleapis.com/job_id:  "clickmodel_train_20171023_154805"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "worker-replica-7"   
  ml.googleapis.com/trial_id:  "1"   
 }
 logName:  "projects/pmc-analytical-data-mart/logs/worker-replica-7"  
 receiveTimestamp:  "2017-10-23T15:57:32.288280956Z"  
 resource: {
  labels: {
   job_id:  "clickmodel_train_20171023_154805"    
   project_id:  "pmc-analytical-data-mart"    
   task_name:  "worker-replica-7"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 timestamp:  "2017-10-23T15:57:26.959845066Z"  
}

{
 insertId:  "11qijbbg2nchav0"  
 jsonPayload: {
  created:  1508774068.64571   
  levelname:  "ERROR"   
  lineno:  335   
  message:  "2017-10-23 15:54:28.645519: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"   
  pathname:  "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "2962580336091050416"   
  compute.googleapis.com/resource_name:  "worker-a28b8b5d9c-8-ch8kg"   
  compute.googleapis.com/zone:  "us-central1-c"   
  ml.googleapis.com/job_id:  "clickmodel_train_20171023_154805"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "worker-replica-8"   
  ml.googleapis.com/trial_id:  "2"   
 }
 logName:  "projects/pmc-analytical-data-mart/logs/worker-replica-8"  
 receiveTimestamp:  "2017-10-23T15:54:59.620612418Z"  
 resource: {
  labels: {
   job_id:  "clickmodel_train_20171023_154805"    
   project_id:  "pmc-analytical-data-mart"    
   task_name:  "worker-replica-8"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 timestamp:  "2017-10-23T15:54:28.645709991Z"  
}

我可能会将我的输入数据文件剥离回大约10个左右的功能,以便减少一个变量。然后我将重新运行相同的超级参数工作,看看我是否只是在下次看到权限错误,如果是这样,我们可以专注于那个第一个。看起来其他两个是记忆性的,所以也许我只需要更大的机器或更小的批次 - 估计我能够自己谷歌我的方式...我想... :)

部分解决方案

好的,所以在经历了很多混乱之后,我认为我有两个问题。

  1. 我每次运行工作时都重复使用相同的输出作业 - 目录(gs:// pmc-ml / clickmodel / vy / output) - 我认为当作业失败时,这会导致一些问题由于某种原因,下一个工作无法完全删除一些遗留的文件。不是100%确定这是否真的是一个问题,但似乎更好的做法是为每个作业创建一个新的输出文件夹。
  2. 我正在过去&#34; - 规模等级STANDARD_1&#34;作为一个论点,这似乎是导致问题的原因(我刚刚提出这个论点吗? - 如果是这样,它很奇怪它不会在验证工作时引发错误)。
  3. 这样可行:

    gcloud ml-engine jobs submit training test_023 \
    --job-dir gs://pmc-ml/clickmodel/vy/output_test_023 \
    --runtime-version 1.2 \
    --module-name trainer.task \
    --package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
    --region us-central1 \
    -- \
    --train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
    --eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
    --train-steps 100 \
    --verbosity DEBUG
    

    但这失败了:

    gcloud ml-engine jobs submit training test_024 \
    --job-dir gs://pmc-ml/clickmodel/vy/output_test_024 \
    --runtime-version 1.2 \
    --module-name trainer.task \
    --package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
    --region us-central1 \
    --scale-tier STANDARD_1 \
    -- \
    --train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
    --eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
    --train-steps 100 \
    --verbosity DEBUG
    

    所以我认为我的问题是,当我尝试使用更广泛的模型和大量数据开始扩展时,我开始通过命令行args传递一些机器配置类型的东西。我不确定我这样做是否正确。看起来我最好把它们放在hptuning_config.yaml文件中并尝试使用这样的调用扩展:

    gcloud ml-engine jobs submit training test_022 \
    --job-dir gs://pmc-ml/clickmodel/vy/output_test_022 \
    --runtime-version 1.2 \
    --module-name trainer.task \
    --package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
    --region us-central1 \
    --config /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/hptuning_config.yaml \
    -- \
    --train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
    --eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
    --train-steps 100 \
    --verbosity DEBUG
    

    hptuning_config.yaml的样子:

    trainingInput:
      scaleTier: CUSTOM
      masterType: large_model
      workerType: standard_gpu
      parameterServerType: large_model
      workerCount: 10
      parameterServerCount: 5
      hyperparameters:
        goal: MAXIMIZE
        hyperparameterMetricTag: accuracy
        maxTrials: 5
        maxParallelTrials: 2
        params:
          - parameterName: first-layer-size
            type: INTEGER
            minValue: 20
            maxValue: 500
            scaleType: UNIT_LINEAR_SCALE
          - parameterName: num-layers
            type: INTEGER
            minValue: 1
            maxValue: 15
            scaleType: UNIT_LINEAR_SCALE
          - parameterName: scale-factor
            type: DOUBLE
            minValue: 0.01
            maxValue: 1.0
            scaleType: UNIT_REVERSE_LOG_SCALE
    

    所以我现在尝试添加我的所有功能,并在1天内训练,然后尝试扩展到更多天和训练步骤等。

    关于传递&#34; - 规模等级STANDARD_1&#34;我不确定这里的根本原因是什么,或者是否有错误。最初我在考虑而不是担心计算不同的机器类型等我只会通过&#34; - 比例级PREMIUM_1&#34;提交作业时(希望)不必担心实际的机器类型等等。所以我认为这里可能还有某种问题。

1 个答案:

答案 0 :(得分:0)

这里看起来有很多问题:

  1. 缺少包裹
  2. 递归删除错误
  3. GPU耗尽内存
  4. 缺少软件包。看起来您正在引用恰好位于作业的输出文件夹中的package-path。输出文件夹很可能被删除(参见#2)。为了防止这种情况,请将您的包放在他们自己的文件夹中,而不受任何作您可以在提交作业和提供本地--staging-dir时使用gcloud的--package-path选项来执行此操作。

    递归删除错误。您提交的task.py正在尝试删除目录 - 可能是输出目录。此失败可能有多种原因,可能是CloudML服务删除一个或多个存在的文件的权限不足。检查ACL,或考虑每次运行时创建一个新的输出目录。

    GPU内存不足错误。 K80只有12 GB的内存。因此要么减小模型的大小(例如,更少的输入要素,更小的层等)。在这种情况下,您可以考虑将初始查找放在CPU上,因为它们可能无法从GPU中受益。如果您正在使用&#34; canned&#34;那可能会更加困难。估算器(例如DNNEstimator),这可能无法为您提供足够的控制权。在这种情况下,要么不使用GPU,要么必须编写自己的模型代码。