Question

我正在尝试使用Tensorflows Estimator和train_and_evaluate函数分发学习任务。在单节点设置中，训练和评估已按预期进行。

（关于训练任务和估计量的定义不应该相关，但是如果有人感兴趣，我已经在previous post中将详细信息发布到了设置中）

但是，在分布式模式下，我遇到了一个问题，即运行参数服务器作业的进程将无法正常启动，并且一旦启动就变得不负责任。即使按如下所示进行最小设置，这也是正确的

def main():
    cluster = tf.train.ClusterSpec({"ps": ["localhost:2222"]})
    server = tf.train.Server(cluster, job_name="ps", task_index=0)
    server.join()

if __name__ == '__main__':
    main()

执行脚本不会引发任何错误，根据日志，服务器应在localhost：2222启动

$ python custom_estimator.py
2018-07-26 18:38:15.347685: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:883] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-26 18:38:15.348352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.42GiB
2018-07-26 18:38:15.348382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-26 18:38:15.538124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-26 18:38:15.538151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-07-26 18:38:15.538175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-07-26 18:38:15.538373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 5193 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-26 18:38:15.587481: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-26 18:38:15.588413: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2222

但是端口2222上没有端口绑定，并且该过程变得不负责任（即无法使用KeyboardInterrupt终止）

$ netstat | grep 2222
# empty result

按照我的previous post中所述启动工作节点或主节点将绑定端口并启动KeyboardInterrupt可行的过程。

$ netstat | grep 2222
tcp6       0      0 simon:2222              localhost:34216         ESTABLISHED
tcp6       0      0 localhost:34216         simon:2222              ESTABLISHED

有人能很好地猜测这里出了什么问题吗？由于我没有收到任何错误，因此很难提供更多信息。

我的系统信息如下

Tensorflow 1.8（从源代码编译）
CUDA 9.0
CUDNN 7.1.3
GPU：GeForce GTX 1060
Python 3.5.2

Tensorflow参数服务器无法启动

0 个答案: