我正在尝试使用keras cloudml示例(https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/keras),而我似乎无法运行云培训。使用python和gcloud的本地训练似乎进展顺利。
我已经在stackexchange上寻找解决方案,谷歌并阅读https://cloud.google.com/ml-engine/docs/how-tos/troubleshooting,但我似乎是唯一一个有这个问题的人(通常强烈表明错误完全是我的!)。除了下面的环境,我还尝试使用python 3.6和tensorflow 1.3但没有成功。
我是一个菜鸟,所以我可能会以一些基本的方式犯错,但我无法发现它。
所有和任何帮助表示赞赏,
: - )
yarc68000。
- 环境 -
(env1) $ python --version
Python 2.7.13 :: Continuum Analytics, Inc.
(env1) $ conda list | grep 'h5py\|keras\|pandas\|numexpr\|tensorflow'
h5py 2.7.1 py27_1 conda-forge
keras 2.0.6 py27_0 conda-forge
numexpr 2.6.2 py27_1 conda-forge
pandas 0.20.3 py27_0 anaconda
tensorflow 1.2.1 <pip>
(env1) $ gcloud --version
Google Cloud SDK 172.0.1
alpha 2017.09.15
beta 2017.09.15
bq 2.0.26
core 2017.09.21
datalab 20170818
gcloud
gsutil 4.27
----------- job --------
(env1) $ export BUCKET=gs://j170922census1
(env1) $ gsutil mb $BUCKET
Creating gs://j170922census1/...
(env1) $ export TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv
(env1) $ export EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv
(env1) $ export JOB_NAME="census_keras_$$"
(env1) $ export TRAIN_STEPS=200
(env1) $ gcloud ml-engine jobs submit training $JOB_NAME --stream-logs --runtime-version 1.2 --job-dir $BUCKET --package-path trainer --module-name trainer.task --region us-central1 -- --train-files $TRAIN_FILE --eval-files $EVAL_FILE --train-steps $TRAIN_STEPS
Job [census_keras_7639] submitted successfully.
INFO 2017-09-22 19:56:56 +0200 service Validating job requirements...
INFO 2017-09-22 19:56:57 +0200 service Job creation request has been successfully validated.
INFO 2017-09-22 19:56:57 +0200 service Job census_keras_7639 is queued.
INFO 2017-09-22 19:56:57 +0200 service Waiting for job to be provisioned.
INFO 2017-09-22 20:01:39 +0200 service Waiting for TensorFlow to start.
INFO 2017-09-22 20:02:55 +0200 master-replica-0 Running task with arguments: --cluster={"master": ["master-cc38d44a51-0:2222"]} --task={"type": "master", "index": 0} --job={
<..>
INFO 2017-09-22 20:04:00 +0200 master-replica-0 197/200 [============================>.] - ETA: 0s - loss: 0.6931 - acc: 0.7563
INFO 2017-09-22 20:04:00 +0200 master-replica-0 200/200 [==============================] - 1s - loss: 0.6931 - acc: 0.7600
INFO 2017-09-22 20:04:00 +0200 master-replica-0 Epoch 10/20
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 "__main__", fname, loader, pkg_name)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 exec code in run_globals
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 199, in <module>
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 dispatch(**parse_args.__dict__)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 121, in dispatch
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks=callbacks)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/models.py", line 1110, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 initial_epoch=initial_epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 return func(*args, **kwargs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/engine/training.py", line 1849, in fit_generator
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callbacks.on_epoch_begin(epoch)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/keras/callbacks.py", line 63, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 callback.on_epoch_begin(epoch, logs)
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in on_epoch_begin
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 census_model = load_model(checkpoints[-1])
ERROR 2017-09-22 20:04:02 +0200 master-replica-0 IndexError: list index out of range
<..>
INFO 2017-09-22 20:04:53 +0200 service Finished tearing down TensorFlow.
INFO 2017-09-22 20:05:49 +0200 service Job failed.
答案 0 :(得分:0)
在Cloud ML Engine上运行此错误实际上存在错误,因为现在GCS上的检查点已被禁用(Keras无法将检查点本机写入GCS)。有关您遇到的问题的即时解决方法,请参阅此PR。另请查看修复检查点问题的pending PR并在GCS上提供文件(无法为Keras执行GCS写入的解决方法)。