我们正在尝试在具有多个参数服务器和工作线程的多台计算机上运行cifar10。我们可以成功地进行分布式培训在培训开始之前,所有工作人员等待以初始化所有参数服务器。在使用iftop观察网络流量时,似乎所有流量都流向单个参数服务器(在其他参数服务器上观察到的流量可忽略不计)设置工作人员和参数的代码如下:
#Construct the cluster and start the server
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")
# Get the number of workers.
num_workers = len(worker_spec)
num_ps = len(ps_spec)
cluster = tf.train.ClusterSpec({"ps": ps_spec,"worker": worker_spec})
with tf.device(tf.train.replica_device_setter(ps_tasks=num_ps, worker_device=worker_device, ps_device="/job:ps/cpu:0", cluster=cluster)):
可以在此处找到cifar10分布式培训的完整代码:https://github.com/nanditav/15712-TensorFlow/blob/master/tensorflow/models/image/cifar10/cifar10_replica.py
答案 0 :(得分:2)
在device_setter_test中有一个使用替代策略的例子,即
with tf.device(tf.train.replica_device_setter(
cluster=self._cluster_spec,
ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
2, _load_fn))):
u = tf.Variable(tf.zeros([2, 2]))
v = tf.Variable(tf.zeros([2, 1]))
w = tf.Variable(tf.zeros([2, 2]))
x = tf.Variable(tf.zeros([1, 3]))
a = v + w
从GreedyLoadBalancingStrategy
有时可能发生的是单个变量很大,在这种情况下,您需要先手动或使用PartitionedVariable