遵循Running MNIST on Cloud TPU教程:
尝试训练时出现以下错误:
python /usr/share/models/official/mnist/mnist_tpu.py \
--tpu=$TPU_NAME \
--DATA_DIR=${STORAGE_BUCKET}/data \
--MODEL_DIR=${STORAGE_BUCKET}/output \
--use_tpu=True \
--iterations=500 \
--train_steps=2000
=>
alexryan@alex-tpu:~/tpu$ ./train-mnist.sh
W1025 20:21:39.351166 139745816463104 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
from . import file_cache
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/share/models/official/mnist/mnist_tpu.py", line 152, in main
tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations, FLAGS.num_shards),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_config.py", line 207, in __init__
self._master = cluster.master()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 223, in master
job_tasks = self.cluster_spec().job_tasks(self._job_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 269, in cluster_spec
(compat.as_text(self._tpu), response['health']))
RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT"
alexryan@alex-tpu:~/tpu$
我与说明不同的唯一地方是:
我没有在云shell中运行ctpu,而是在Mac上运行了它。
>ctpu version
ctpu version: 1.7
TPU所在的区域与我的配置的默认区域不同,因此我将其指定为如下选项:
>cat ctpu-up.sh
ctpu up --zone us-central1-b --preemptible
我能够将MNIST文件从vm移到gcs存储桶,没问题:
alexryan@alex-tpu:~$ gsutil cp -r ./data ${STORAGE_BUCKET}
Copying file://./data/validation.tfrecords [Content-Type=application/octet-stream]...
Copying file://./data/train-images-idx3-ubyte.gz [Content-Type=application/octet-stream]...
我尝试了(Optional) Set up TensorBoard> Running cloud_tpu_profiler
转到Cloud Console> TPU>,然后单击您创建的TPU。 找到Cloud TPU的服务帐户名称并复制它,以便 例如:
service-11111111118@cloud-tpu.iam.myserviceaccount.com
在存储桶列表中,选择要使用的存储桶,然后选择显示 信息面板,然后选择编辑存储桶权限。贴上你的 服务帐户名称进入该存储桶的添加成员字段,并 选择以下权限:
“ Cloud Console> TPU”作为选项不存在
所以我使用了与虚拟机关联的服务帐户
“云控制台> Compute Engine> alex-tpu”
由于最后一条错误消息是“ RuntimeError:TPU“ alex-tpu”不健康:“ TIMEOUT”,因此我使用ctpu删除了vm,然后重新创建它并再次运行它。 这次我遇到了更多错误:
这似乎只是一个警告...
ImportError: file_cache is unavailable when using oauth2client >=
4.0.0 or google-auth
不确定这一点...
ERROR:tensorflow:Operation of type Placeholder (reshape_input) is not supported on the TPU. Execution will fail if this op is used in the graph.
这似乎杀死了训练...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562') [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64],
_device="/job:worker/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv2d/bias/Read/ReadVariableOp, conv2d/kernel/Read/ReadVariableOp, conv2d_1/bias/Read/ReadVariableOp, conv2d_1/kernel/Read/ReadVariableOp, dense/bias/Read/ReadVariableOp, dense/kernel/Read/ReadVariableOp, dense_1/bias/Read/ReadVariableOp, dense_1/kernel/Read/ReadVariableOp, global_step/Read/ReadVariableOp)]]
Caused by op u'save/SaveV2', defined at: File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv)) File "/usr/share/models/official/mnist/mnist_tpu.py", line 163, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1406, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
_WrappedSession.__init__(self, self._create_session()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
self._scaffold.finalize() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
self._saver.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1106, in build
self._build(self._filename, build_save=True, build_restore=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1143, in _build
build_save=build_save, build_restore=build_restore) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 778, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
save = self.save_op(filename_tensor, saveables) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
tensors) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
UnimplementedError (see above for traceback): File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')
更新
我收到此错误...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented
...即使--use_tpu = False
alexryan@alex-tpu:~/tpu$ cat train-mnist.sh
python /usr/share/models/official/mnist/mnist_tpu.py \
--tpu=$TPU_NAME \
--DATA_DIR=${STORAGE_BUCKET}/data \
--MODEL_DIR=${STORAGE_BUCKET}/output \
--use_tpu=False \
--iterations=500 \
--train_steps=2000
This stack overflow answer建议tpu尝试写入不存在的文件系统,而不是我指定的gcs存储桶。我不清楚为什么会这样。
答案 0 :(得分:1)
在第一种情况下,您创建的TPU似乎没有处于健康状态。因此,删除并重新创建TPU或整个VM是解决此问题的正确方法。
我认为第二种情况(您删除了虚拟机并重新创建)的错误是由于您的$ {STORAGE_BUCKET}不确定或不是正确的GCS存储桶。它应该是一个GCS存储桶。本地路径不起作用,并显示以下错误。 有关创建GCS存储桶的更多信息,请参见https://cloud.google.com/tpu/docs/tutorials/mnist
的“创建云存储桶”部分希望这能回答您的问题。
答案 1 :(得分:0)
遇到同样的问题,发现本教程中有错别字。如果选中mnist_tpu.py,则会发现参数必须为小写。
如果您进行更改,则可以正常工作。
python /usr/share/models/official/mnist/mnist_tpu.py \
--tpu=$TPU_NAME \
--data_dir=${STORAGE_BUCKET}/data \
--model_dir=${STORAGE_BUCKET}/output \
--use_tpu=True \
--iterations=500 \
--train_steps=2000