我正在尝试使用Tensorflows Estimator和train_and_evaluate函数分发学习任务。在单节点设置中,训练和评估已按预期进行。
(关于训练任务和估计量的定义不应该相关,但是如果有人感兴趣,我已经在previous post中将详细信息发布到了设置中)
但是,在分布式模式下,我遇到了一个问题,即运行参数服务器作业的进程将无法正常启动,并且一旦启动就变得不负责任。即使按如下所示进行最小设置,这也是正确的
def main():
cluster = tf.train.ClusterSpec({"ps": ["localhost:2222"]})
server = tf.train.Server(cluster, job_name="ps", task_index=0)
server.join()
if __name__ == '__main__':
main()
执行脚本不会引发任何错误,根据日志,服务器应在localhost:2222启动
$ python custom_estimator.py
2018-07-26 18:38:15.347685: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:883] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-26 18:38:15.348352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.42GiB
2018-07-26 18:38:15.348382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-07-26 18:38:15.538124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-26 18:38:15.538151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2018-07-26 18:38:15.538175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2018-07-26 18:38:15.538373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 5193 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-26 18:38:15.587481: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-26 18:38:15.588413: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2222
但是端口2222上没有端口绑定,并且该过程变得不负责任(即无法使用KeyboardInterrupt
终止)
$ netstat | grep 2222
# empty result
按照我的previous post中所述启动工作节点或主节点将绑定端口并启动KeyboardInterrupt
可行的过程。
$ netstat | grep 2222
tcp6 0 0 simon:2222 localhost:34216 ESTABLISHED
tcp6 0 0 localhost:34216 simon:2222 ESTABLISHED
有人能很好地猜测这里出了什么问题吗?由于我没有收到任何错误,因此很难提供更多信息。
我的系统信息如下
Tensorflow 1.8(从源代码编译)
CUDA 9.0
CUDNN 7.1.3
GPU:GeForce GTX 1060
Python 3.5.2