我正在尝试跨三台计算机运行分布式Tensorflow脚本:运行参数服务器的本地计算机和两台可以访问正在运行的辅助作业的远程计算机。我正在关注Tensorflow文档中的this example,将IP地址和唯一端口号传递给每个工作人员作业,并将protocol
中的tf.train.Server
选项设置为'grpc'
。但是,当我运行脚本时,所有三个进程都在我的localhost上启动,并且远程计算机上没有任何作业。我缺少一步吗?
我的(删节)代码:
# Define flags
tf.app.flags.DEFINE_string("ps_hosts", "localhost:2223",
"comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts",
"server1.com:2224,server2.com:2225",
"comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("job_name", "worker", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index, protocol='grpc')
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Between-graph replication
with tf.device(tf.train.replica_device_setter(cluster=cluster, worker_device="/job:worker/task:{}".format(FLAGS.task_index))):
# Create model...
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir="./checkpoint",
init_op=init_op,
summary_op=summary,
saver=saver,
global_step=global_step,
save_model_secs=600)
with sv.managed_session(server.target,
config=config_proto) as sess:
# Train model...
此代码导致两个问题:
来自worker0:
2018-04-09 23:48:39.749679: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
来自worker1:
2018-04-09 23:49:30.439166: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
device_filter
摆脱早期的问题,但所有作业都在我的本地计算机上启动,而不是在远程服务器上启动。 如何在远程服务器上运行两个辅助作业?
答案 0 :(得分:0)
我的理解是您必须在群集的所有主机上运行此脚本。与
参数服务器上的“ - job_name = ps”参数和
工人上的“ - job_name = worker --task_index = [0,1]”。