Question

Distributed Tensorflow的当前架构基于＆＃34;类似参数服务器＆＃34;框架。使用tf.device(tf.train.replica_device_setter())，所有张量＆＃34;变量＆＃34;被放置在＆＃34;参数服务器＆＃34; （＆＃34; PS＆＃34;）和其他张量操作分配给＆＃34;工人＆＃34;。根据我的理解，＆＃34;工人＆＃34;之间会有很多沟通开销。和＆＃34; PS＆＃34;。原因是每个工人都没有那些＆＃34;变量＆＃34;的本地复制品。存储在＆＃34; PS＆＃34;中，实际上在训练期间引入了更多的通信，从＆＃34; PS＆＃34;中检索变量，计算中间结果并将它们发送回＆＃34; PS＆＃34;更新那些张量＆＃34;变量＆＃34; ...

现在，如果我们不遵守规则，我们会采用＆＃34; DistBelief＆＃34;架构：所有共享参数（如神经网络权重）仍然放在＆＃34; PS＆＃34;，但每个工作人员现在都有共享张量的复制品＆＃34;变量＆＃34;它们存储在＆＃34; PS＆＃34;中。好处是在培训期间，每个工人不必与＆＃34; PS＆＃34;进行通信，而只是使用其本地复制品来计算梯度，以及＆＃34;工人＆＃34;之间的沟通。和＆＃34; PS＆＃34;只有在＆＃34; PS＆＃34;中更新共享参数（神经网络权重）时才会发生。在分布式Tensorflow中，有没有办法呢？

Answer 1

你的问题是，除了由首席工作人员处理的全局变量之外，非首席工作人员有自己的一组局部变量，需要在工人重启时进行初始化。

example看看这个abenmao。您可以创建session.run挂钩，该挂钩可初始化局部变量或全局变量。然后使用正确的钩子创建MonitoredTraining会话，具体取决于工作者是否为主。

  ma_hook = ma.make_ma_run_hook()
  # And also, create the hook which handles initialization and queues.
  ma_replicas_hook = ma.make_session_run_hook(is_chief)
  ```
  In the training program, every worker will run the train_op as if not
  model_average or synchronized. Note that if you want to run other ops like
  test op, you should use common session instead of monitoredSession:
  ```python
  with training.MonitoredTrainingSession(
      master=workers[worker_id].target, is_chief=is_chief,
      hooks=[ma_replicas_hook, ma_hook]) as mon_sess:
    while not mon_sess.should_stop():
      mon_sess.run(training_op)
  ...

  def make_session_run_hook(self, is_chief, num_tokens=0):
    """Creates a hook to handle ReplicasHook ops such as initialization."""
    if self._ma_run_hook is False:
      raise ValueError("make_session_run_hook Should be "
                       "called after make_ma_run_hook.")

    if is_chief:
      return self._ReplicasHook(self.chief_init_op,
                                self.ready_for_local_init_op,
                                self.get_chief_queue_runner(),
                                self.get_init_tokens_op(num_tokens))

    return self._ReplicasHook(self.local_step_init_op,
                              self.ready_for_local_init_op, None, None)

如何实施＆＃34; DistBelief＆＃34;分布式Tensorflow中的体系结构

1 个答案: