无法将设备分配给分布式TensorFlow中的节点

时间:2017-04-07 21:34:06

标签: tensorflow kubernetes distributed

我正在尝试按照here

中的google cloud ML示例运行分布式TF

我在kubernetes集群上运行它并正确配置了所有环境变量。 (2 ps和2名工人) 我收到以下错误:

2017-04-07T21:36:51.092443795Z {"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}
2017-04-07T21:36:51.092473871Z {u'environment': u'cloud', u'cluster': {u'ps': [u'census-ps-0:5000', u'census-ps-1:5000'], u'worker': [u'census-worker-0:5000', u'census-worker-1:5000'], u'master': [u'census-worker-0:5000']}, u'task': {u'type': u'master', u'inxex': 0}}
2017-04-07T21:36:51.907203514Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907227466Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907231184Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907234415Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907237325Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.907240325Z W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-04-07T21:36:51.914365914Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job master -> {0 -> localhost:5000}
2017-04-07T21:36:51.914383815Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> census-ps-0:5000, 1 -> census-ps-1:5000}
2017-04-07T21:36:51.914387511Z I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> census-worker-0:5000, 1 -> census-worker-1:5000}
2017-04-07T21:36:51.914974731Z I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:5000
2017-04-07T21:36:54.784234307Z I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session dd8a251a59872860 with config: 
2017-04-07T21:36:54.784259971Z gpu_options {
2017-04-07T21:36:54.784263535Z   per_process_gpu_memory_fraction: 1
2017-04-07T21:36:54.784266273Z }
2017-04-07T21:36:54.784268677Z 
2017-04-07T21:36:54.861483497Z export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}'Starting Census: Please lauch tensorboard to see results: tensorboard --logdir=$MODEL_DIR
2017-04-07T21:36:54.86148432Z Traceback (most recent call last):
2017-04-07T21:36:54.861527172Z   File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
2017-04-07T21:36:54.861535317Z     "__main__", fname, loader, pkg_name)
2017-04-07T21:36:54.861540705Z   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
2017-04-07T21:36:54.861627932Z     exec code in run_globals
2017-04-07T21:36:54.861641191Z   File "/code/task.py", line 192, in <module>
2017-04-07T21:36:54.86166076Z     learn_runner.run(generate_experiment_fn(**arguments), job_dir)
2017-04-07T21:36:54.861668307Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
2017-04-07T21:36:54.861692382Z     return task()
2017-04-07T21:36:54.861698247Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate
2017-04-07T21:36:54.86177589Z     self.train(delay_secs=0)
2017-04-07T21:36:54.86178479Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train
2017-04-07T21:36:54.861792289Z     monitors=self._train_monitors + extra_hooks)
2017-04-07T21:36:54.861795862Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
2017-04-07T21:36:54.861845229Z     return func(*args, **kwargs)
2017-04-07T21:36:54.863930393Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
2017-04-07T21:36:54.863933057Z     loss = self._train_model(input_fn=input_fn, hooks=hooks)
2017-04-07T21:36:54.863935517Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model
2017-04-07T21:36:54.863938172Z     config=self.config.tf_config) as mon_sess:
2017-04-07T21:36:54.863940574Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in MonitoredTrainingSession
2017-04-07T21:36:54.863943261Z     return MonitoredSession(session_creator=session_creator, hooks=all_hooks)
2017-04-07T21:36:54.863945685Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in __init__
2017-04-07T21:36:54.863948181Z     session_creator, hooks, should_recover=True)
2017-04-07T21:36:54.863950474Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in __init__
2017-04-07T21:36:54.863952972Z     self._sess = _RecoverableSession(self._coordinated_creator)
2017-04-07T21:36:54.863955292Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in __init__
2017-04-07T21:36:54.863957783Z     _WrappedSession.__init__(self, self._create_session())
2017-04-07T21:36:54.863960045Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session
2017-04-07T21:36:54.863965454Z     return self._sess_creator.create_session()
2017-04-07T21:36:54.863967812Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session
2017-04-07T21:36:54.863970316Z     self.tf_sess = self._session_creator.create_session()
2017-04-07T21:36:54.863972622Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 366, in create_session
2017-04-07T21:36:54.863975112Z     self._scaffold.finalize()
2017-04-07T21:36:54.863977366Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 183, in finalize
2017-04-07T21:36:54.863979905Z     self._saver.build()
2017-04-07T21:36:54.863982274Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build
2017-04-07T21:36:54.863984743Z     restore_sequentially=self._restore_sequentially)
2017-04-07T21:36:54.863987905Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 671, in build
2017-04-07T21:36:54.86399038Z     restore_sequentially, reshape)
2017-04-07T21:36:54.863992624Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 445, in _AddShardedRestoreOps
2017-04-07T21:36:54.863995148Z     name="restore_shard"))
2017-04-07T21:36:54.863997503Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
2017-04-07T21:36:54.863999968Z     tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
2017-04-07T21:36:54.864002332Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op
2017-04-07T21:36:54.864004812Z     [spec.tensor.dtype])[0])
2017-04-07T21:36:54.864007694Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
2017-04-07T21:36:54.864010199Z     dtypes=dtypes, name=name)
2017-04-07T21:36:54.864012414Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
2017-04-07T21:36:54.86401491Z     op_def=op_def)
2017-04-07T21:36:54.864017117Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
2017-04-07T21:36:54.864028044Z     original_op=self._default_original_op, op_def=op_def)
2017-04-07T21:36:54.864030331Z   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
2017-04-07T21:36:54.864032899Z     self._traceback = _extract_stack()
2017-04-07T21:36:54.864035157Z 
2017-04-07T21:36:54.864037633Z InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_102': Could not satisfy explicit device specification '/job:ps/task:1/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:master/replica:0/task:0/cpu:0, /job:ps/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/cpu:0
2017-04-07T21:36:54.864043209Z   [[Node: save/RestoreV2_102 = RestoreV2[dtypes=[DT_STRING], _device="/job:ps/task:1/device:CPU:0"](save/Const, save/RestoreV2_102/tensor_names, save/RestoreV2_102/shape_and_slices)]]
2017-04-07T21:36:54.864046084Z 

0 个答案:

没有答案