我正在尝试按照here
中的google cloud ML示例运行分布式TF我在kubernetes集群上运行它并正确配置了所有环境变量。 (2 ps和2名工人) 我收到以下错误:
2017-04-07T21:36:51.092443795Z {"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}
2017-04-07T21:36:51.092473871Z {u'environment': u'cloud', u'cluster': {u'ps': [u'census-ps-0:5000', u'census-ps-1:5000'], u'worker': [u'census-worker-0:5000', u'census-worker-1:5000'], u'master': [u'census-worker-0:5000']}, u'task': {u'type': u'master', u'inxex': 0}}
2017-04-07T21:36:51.907203514Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907227466Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907231184Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907234415Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907237325Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907240325Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.914365914Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job master -> {0 -> localhost:5000}
2017-04-07T21:36:51.914383815Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> census-ps-0:5000, 1 -> census-ps-1:5000}
2017-04-07T21:36:51.914387511Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> census-worker-0:5000, 1 -> census-worker-1:5000}
2017-04-07T21:36:51.914974731Z I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:5000
2017-04-07T21:36:54.784234307Z I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session dd8a251a59872860 with config:
2017-04-07T21:36:54.784259971Z gpu_options {
2017-04-07T21:36:54.784263535Z per_process_gpu_memory_fraction: 1
2017-04-07T21:36:54.784266273Z }
2017-04-07T21:36:54.784268677Z
2017-04-07T21:36:54.861483497Z export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}'Starting Census: Please lauch tensorboard to see results: tensorboard --logdir=$MODEL_DIR
2017-04-07T21:36:54.86148432Z Traceback (most recent call last):
2017-04-07T21:36:54.861527172Z File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
2017-04-07T21:36:54.861535317Z "__main__", fname, loader, pkg_name)
2017-04-07T21:36:54.861540705Z File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
2017-04-07T21:36:54.861627932Z exec code in run_globals
2017-04-07T21:36:54.861641191Z File "/code/task.py", line 192, in <module>
2017-04-07T21:36:54.86166076Z learn_runner.run(generate_experiment_fn(**arguments), job_dir)
2017-04-07T21:36:54.861668307Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
2017-04-07T21:36:54.861692382Z return task()
2017-04-07T21:36:54.861698247Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate
2017-04-07T21:36:54.86177589Z self.train(delay_secs=0)
2017-04-07T21:36:54.86178479Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train
2017-04-07T21:36:54.861792289Z monitors=self._train_monitors + extra_hooks)
2017-04-07T21:36:54.861795862Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
2017-04-07T21:36:54.861845229Z return func(*args, **kwargs)
2017-04-07T21:36:54.863930393Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
2017-04-07T21:36:54.863933057Z loss = self._train_model(input_fn=input_fn, hooks=hooks)
2017-04-07T21:36:54.863935517Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model
2017-04-07T21:36:54.863938172Z config=self.config.tf_config) as mon_sess:
2017-04-07T21:36:54.863940574Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in MonitoredTrainingSession
2017-04-07T21:36:54.863943261Z return MonitoredSession(session_creator=session_creator, hooks=all_hooks)
2017-04-07T21:36:54.863945685Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in __init__
2017-04-07T21:36:54.863948181Z session_creator, hooks, should_recover=True)
2017-04-07T21:36:54.863950474Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in __init__
2017-04-07T21:36:54.863952972Z self._sess = _RecoverableSession(self._coordinated_creator)
2017-04-07T21:36:54.863955292Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in __init__
2017-04-07T21:36:54.863957783Z _WrappedSession.__init__(self, self._create_session())
2017-04-07T21:36:54.863960045Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session
2017-04-07T21:36:54.863965454Z return self._sess_creator.create_session()
2017-04-07T21:36:54.863967812Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session
2017-04-07T21:36:54.863970316Z self.tf_sess = self._session_creator.create_session()
2017-04-07T21:36:54.863972622Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 366, in create_session
2017-04-07T21:36:54.863975112Z self._scaffold.finalize()
2017-04-07T21:36:54.863977366Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 183, in finalize
2017-04-07T21:36:54.863979905Z self._saver.build()
2017-04-07T21:36:54.863982274Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build
2017-04-07T21:36:54.863984743Z restore_sequentially=self._restore_sequentially)
2017-04-07T21:36:54.863987905Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 671, in build
2017-04-07T21:36:54.86399038Z restore_sequentially, reshape)
2017-04-07T21:36:54.863992624Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 445, in _AddShardedRestoreOps
2017-04-07T21:36:54.863995148Z name="restore_shard"))
2017-04-07T21:36:54.863997503Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
2017-04-07T21:36:54.863999968Z tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
2017-04-07T21:36:54.864002332Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op
2017-04-07T21:36:54.864004812Z [spec.tensor.dtype])[0])
2017-04-07T21:36:54.864007694Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
2017-04-07T21:36:54.864010199Z dtypes=dtypes, name=name)
2017-04-07T21:36:54.864012414Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
2017-04-07T21:36:54.86401491Z op_def=op_def)
2017-04-07T21:36:54.864017117Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
2017-04-07T21:36:54.864028044Z original_op=self._default_original_op, op_def=op_def)
2017-04-07T21:36:54.864030331Z File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
2017-04-07T21:36:54.864032899Z self._traceback = _extract_stack()
2017-04-07T21:36:54.864035157Z
2017-04-07T21:36:54.864037633Z InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_102': Could not satisfy explicit device specification '/job:ps/task:1/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:master/replica:0/task:0/cpu:0, /job:ps/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/cpu:0
2017-04-07T21:36:54.864043209Z [[Node: save/RestoreV2_102 = RestoreV2[dtypes=[DT_STRING], _device="/job:ps/task:1/device:CPU:0"](save/Const, save/RestoreV2_102/tensor_names, save/RestoreV2_102/shape_and_slices)]]
2017-04-07T21:36:54.864046084Z