我试图建立一个广泛而深入的张量流模型并在谷歌云上进行训练。
我已经能够做到并训练较小的开发版本。
然而,我现在正在努力扩展到更多数据和更多培训步骤,我的在线培训工作仍然失败。
它运行5分钟左右然后我得到以下错误:
The replica worker 2 exited with a non-zero status of 1. Termination reason: Error.
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_123542&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_123542%22
当我查看日志时,我可以看到这些错误似乎是个问题:
Command '['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1
我不确定这里发生了什么。我觉得这可能与我训练模型的机器类型有关。但我已经尝试从" STANDARD_1"到" PREMIUM_1"我也尝试使用" complex_model_l"的自定义机器类型。和" large_model"对于参数服务器。
我正在使用的数据中有大约1400个功能。
我只在一天的数据培训1000步,我已经减少了批量大小。我可以像这样在本地训练但是当我尝试在云中训练它时(即使步数很少)我也会遇到这个错误。
我不确定接下来要尝试什么......
看起来gsutil命令可能正在将模型的打包版本复制到本地工作者并导致问题。没想到宽大而深的模型的1400个功能足以让我担心我的模型太大了。所以我并不确定我是在想我正在进行的事情,因为我希望使用其他机器类型和自定义配置可以解决这个问题。
P.S。这是我正在使用的自定义配置的yaml:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: large_model
parameterServerType: large_model
workerCount: 15
parameterServerCount: 10
我培训模型的呼吁就像:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--config $CONFIG \
--module-name trainer.task \
--package-path $PACKAGE_PATH \
--region $REGION \
--scale-tier CUSTOM \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--verbosity DEBUG \
--eval-steps 100 \
--num-layers 2 \
--first-layer-size 200 \
--scale-factor 0.99
上面的更新 我运行了一个超级协议调优工作,实际上就是之前为我工作的东西。这是工作信息:
clickmodel_train_20171023_154805
Failed (10 min 19 sec)
Creation time
Oct 23, 2017, 4:48:08 PM
Start time
Oct 23, 2017, 4:48:12 PM
End time
Oct 23, 2017, 4:58:27 PM
Logs
View logs
Error message
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: num-layers=11, scale-factor=0.47899098586647881, first-layer-size=498, . The trial's error message was: The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Training input
{
"scaleTier": "CUSTOM",
"masterType": "large_model",
"workerType": "standard_gpu",
"parameterServerType": "large_model",
"workerCount": "10",
"parameterServerCount": "5",
"packageUris": [
"gs://pmc-ml/clickmodel/vy/output/packages/326616fb7bab86d0d534c03f3260a0ff38c86112850b478ba28eca1e9d12d092/trainer-0.0.0.tar.gz"
],
"pythonModule": "trainer.task",
"args": [
"--train-files",
"gs://pmc-ml/clickmodel/vy/data/train_data_20170901*.csv",
"--eval-files",
"gs://pmc-ml/clickmodel/vy/data/dev_data_20170901*.csv",
"--train-steps",
"1000",
"--verbosity",
"DEBUG",
"--eval-steps",
"100",
"--num-layers",
"2",
"--first-layer-size",
"200",
"--scale-factor",
"0.99",
"--train-batch-size",
"100",
"--eval-batch-size",
"100"
],
"hyperparameters": {
"goal": "MAXIMIZE",
"params": [
{
"parameterName": "first-layer-size",
"minValue": 50,
"maxValue": 500,
"type": "INTEGER",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "num-layers",
"minValue": 1,
"maxValue": 15,
"type": "INTEGER",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "scale-factor",
"minValue": 0.1,
"maxValue": 1,
"type": "DOUBLE",
"scaleType": "UNIT_REVERSE_LOG_SCALE"
}
],
"maxTrials": 12,
"maxParallelTrials": 2,
"hyperparameterMetricTag": "accuracy"
},
"region": "us-central1",
"runtimeVersion": "1.2",
"jobDir": "gs://pmc-ml/clickmodel/vy/output"
}
但我现在遇到错误:
16:58:06.188
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Expand all | Collapse all {
insertId: "w77g2yg1zqa5fl"
logName: "projects/pmc-analytical-data-mart/logs/ml.googleapis.com%2Fclickmodel_train_20171023_154805"
receiveTimestamp: "2017-10-23T15:58:06.188221966Z"
resource: {…}
severity: "ERROR"
textPayload: "The replica worker 4 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module>
tf.gfile.DeleteRecursively(args.job_dir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
PermissionDeniedError: could not fully delete dir
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22"
timestamp: "2017-10-23T15:58:06.188221966Z"
}
看起来确实可能是一种许可。我已将cloud-logs@google.com,cloud-ml-service@pmc-analytical-data-mart-8c548.iam.gserviceaccount.com,cloud-ml@google.com作为管理员添加到pmc-ml存储桶中。我想知道还有什么我想念的。
另一个更新
我现在也在日志中看到这些错误,但不确定它们是否相关:
{
insertId: "1986fw7g2uya0b9"
jsonPayload: {
created: 1508774246.95985
levelname: "ERROR"
lineno: 335
message: "2017-10-23 15:57:26.959642: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "7863680028519935658"
compute.googleapis.com/resource_name: "worker-f13b3addb0-7-s6dxq"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id: "clickmodel_train_20171023_154805"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "worker-replica-7"
ml.googleapis.com/trial_id: "1"
}
logName: "projects/pmc-analytical-data-mart/logs/worker-replica-7"
receiveTimestamp: "2017-10-23T15:57:32.288280956Z"
resource: {
labels: {
job_id: "clickmodel_train_20171023_154805"
project_id: "pmc-analytical-data-mart"
task_name: "worker-replica-7"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2017-10-23T15:57:26.959845066Z"
}
和
{
insertId: "11qijbbg2nchav0"
jsonPayload: {
created: 1508774068.64571
levelname: "ERROR"
lineno: 335
message: "2017-10-23 15:54:28.645519: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "2962580336091050416"
compute.googleapis.com/resource_name: "worker-a28b8b5d9c-8-ch8kg"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id: "clickmodel_train_20171023_154805"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "worker-replica-8"
ml.googleapis.com/trial_id: "2"
}
logName: "projects/pmc-analytical-data-mart/logs/worker-replica-8"
receiveTimestamp: "2017-10-23T15:54:59.620612418Z"
resource: {
labels: {
job_id: "clickmodel_train_20171023_154805"
project_id: "pmc-analytical-data-mart"
task_name: "worker-replica-8"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2017-10-23T15:54:28.645709991Z"
}
我可能会将我的输入数据文件剥离回大约10个左右的功能,以便减少一个变量。然后我将重新运行相同的超级参数工作,看看我是否只是在下次看到权限错误,如果是这样,我们可以专注于那个第一个。看起来其他两个是记忆性的,所以也许我只需要更大的机器或更小的批次 - 估计我能够自己谷歌我的方式...我想... :)
部分解决方案
好的,所以在经历了很多混乱之后,我认为我有两个问题。
这样可行:
gcloud ml-engine jobs submit training test_023 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_023 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
但这失败了:
gcloud ml-engine jobs submit training test_024 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_024 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--scale-tier STANDARD_1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
所以我认为我的问题是,当我尝试使用更广泛的模型和大量数据开始扩展时,我开始通过命令行args传递一些机器配置类型的东西。我不确定我这样做是否正确。看起来我最好把它们放在hptuning_config.yaml文件中并尝试使用这样的调用扩展:
gcloud ml-engine jobs submit training test_022 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_022 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--config /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/hptuning_config.yaml \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
hptuning_config.yaml的样子:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: standard_gpu
parameterServerType: large_model
workerCount: 10
parameterServerCount: 5
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: accuracy
maxTrials: 5
maxParallelTrials: 2
params:
- parameterName: first-layer-size
type: INTEGER
minValue: 20
maxValue: 500
scaleType: UNIT_LINEAR_SCALE
- parameterName: num-layers
type: INTEGER
minValue: 1
maxValue: 15
scaleType: UNIT_LINEAR_SCALE
- parameterName: scale-factor
type: DOUBLE
minValue: 0.01
maxValue: 1.0
scaleType: UNIT_REVERSE_LOG_SCALE
所以我现在尝试添加我的所有功能,并在1天内训练,然后尝试扩展到更多天和训练步骤等。
关于传递&#34; - 规模等级STANDARD_1&#34;我不确定这里的根本原因是什么,或者是否有错误。最初我在考虑而不是担心计算不同的机器类型等我只会通过&#34; - 比例级PREMIUM_1&#34;提交作业时(希望)不必担心实际的机器类型等等。所以我认为这里可能还有某种问题。
答案 0 :(得分:0)
这里看起来有很多问题:
缺少软件包。看起来您正在引用恰好位于作业的输出文件夹中的package-path
。输出文件夹很可能被删除(参见#2)。为了防止这种情况,请将您的包放在他们自己的文件夹中,而不受任何作您可以在提交作业和提供本地--staging-dir
时使用gcloud的--package-path
选项来执行此操作。
递归删除错误。您提交的task.py
正在尝试删除目录 - 可能是输出目录。此失败可能有多种原因,可能是CloudML服务删除一个或多个存在的文件的权限不足。检查ACL,或考虑每次运行时创建一个新的输出目录。
GPU内存不足错误。 K80只有12 GB的内存。因此要么减小模型的大小(例如,更少的输入要素,更小的层等)。在这种情况下,您可以考虑将初始查找放在CPU上,因为它们可能无法从GPU中受益。如果您正在使用&#34; canned&#34;那可能会更加困难。估算器(例如DNNEstimator
),这可能无法为您提供足够的控制权。在这种情况下,要么不使用GPU,要么必须编写自己的模型代码。