Tensorflow参数服务器无法启动

时间:2018-07-26 17:38:47

标签: python tensorflow distributed-computing tensorflow-estimator

我正在尝试使用Tensorflows Estimatortrain_and_evaluate函数分发学习任务。在单节点设置中,训练和评估已按预期进行。

(关于训练任务和估计量的定义不应该相关,但是如果有人感兴趣,我已经在previous post中将详细信息发布到了设置中)

但是,在分布式模式下,我遇到了一个问题,即运行参数服务器作业的进程将无法正常启动,并且一旦启动就变得不负责任。即使按如下所示进行最小设置,这也是正确的

def main():
    cluster = tf.train.ClusterSpec({"ps": ["localhost:2222"]})
    server = tf.train.Server(cluster, job_name="ps", task_index=0)
    server.join()

if __name__ == '__main__':
    main()

执行脚本不会引发任何错误,根据日志,服务器应在localhost:2222启动

$ python custom_estimator.py
2018-07-26 18:38:15.347685: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:883] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-26 18:38:15.348352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.42GiB
2018-07-26 18:38:15.348382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-26 18:38:15.538124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-26 18:38:15.538151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-07-26 18:38:15.538175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-07-26 18:38:15.538373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 5193 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-26 18:38:15.587481: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-26 18:38:15.588413: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2222

但是端口2222上没有端口绑定,并且该过程变得不负责任(即无法使用KeyboardInterrupt终止)

$ netstat | grep 2222
# empty result

按照我的previous post中所述启动工作节点或主节点将绑定端口并启动KeyboardInterrupt可行的过程。

$ netstat | grep 2222
tcp6       0      0 simon:2222              localhost:34216         ESTABLISHED
tcp6       0      0 localhost:34216         simon:2222              ESTABLISHED

有人能很好地猜测这里出了什么问题吗?由于我没有收到任何错误,因此很难提供更多信息。

我的系统信息如下

  • Tensorflow 1.8(从源代码编译)

  • CUDA 9.0

  • CUDNN 7.1.3

  • GPU:GeForce GTX 1060

  • Python 3.5.2

0 个答案:

没有答案