Question

我正在使用大型模型在Tensorflow中开展一个项目（一个不适合4 gb VGA的项目）。 Tl; dr：在CPU上运行模型的一部分而GPU上的另一部分需要大约4秒/批次。我们正在制作一个应该在2-3台计算机上运行的分发系统，我们希望以最终加快这一过程的方式分发任务。

详细说明：由于缺乏适当的文档（或其他教程/指南）来分配张量流，我们面临很多问题。我们能够构建的最佳分发系统正在运行：1 ps和1名工作人员（工作人员参与CPU，第二部分参与GPU），大约需要6秒/批次。然后我们尝试了另一个设置：1 ps，2名工人（每个工作人员使用4gb VGA），我们达到的最佳时间是大约7秒/批次，最后一次设置，在同一台计算机上是2 ps，然而每个工作人员是2名工人运行整个模型，因此每个人都以不同的方式训练批次。

args.cluster = tf.train.ClusterSpec({"ps": args.ps_hosts.split(","), "worker": args.worker_hosts.split(",")})
args.server = tf.train.Server(args.cluster,job_name=args.job_name,task_index=args.task_index)
if(args.job_name=="ps"):
    server.join()
else:
    with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % self.task_index,cluster=cluster)):

#Rest of code
.....
# Part where I divide the half on the cpu and half on the gpu:
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d/cpu:0" % self.task_index,cluster=cluster)):
            logger.write("First half gradient on  cpu")
            testGradient2 = tf.gradients(self.cost, tvars[len(tvars)/2:])
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d/gpu:0" % self.task_index,cluster=cluster)):
            logger.write("Second half gradient on gpu")
            testGradient1 = tf.gradients(self.cost, tvars[:len(tvars)/2])
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % self.task_index,cluster=cluster)):

        testGradient = testGradient1+testGradient2
....
#Supervisor part and configuration and session setup 
sv = tf.train.Supervisor(is_chief=(self.task_index == 0), init_op=tf.global_variables_initializer())
config = tf.ConfigProto(allow_soft_placement = True)
self.sess = sv.prepare_or_wait_for_session(server.target,config=config)

正如我所说，这段代码以6.5秒/批量运行，在2 ps（在同一台计算机上）和2名工作人员上运行，是否有任何我失踪的优化或要点？

使用Tensorflow的慢速分发系统

0 个答案: