Question

并行执行多个tf.Session()的正式方法是使用Distributed TensorFlow 中所述的tf.train.Server。另一方面，以下适用于Keras，可以根据Keras + Tensorflow and Multiprocessing in Python修改为Tensorflow而不使用tf.train.Server。

def _training_worker(train_params):
    import keras
    model = obtain_model(train_params)
    model.fit(train_params)
    send_message_to_main_process(...)

def train_new_model(train_params):
    training_process = multiprocessing.Process(target=_training_worker, args = train_params)
    training_process.start()
    get_message_from_training_process(...)
    training_process.join()

第一种方法比第二种方法更快吗？我有一个以第二种方式编写的代码，由于我的算法（AlphaZero）的性质，单个GPU应该运行许多进程，每个进程执行微小的miniatch的预测。

Answer 1

当需要在不同节点之间进行通信时，

tf.train.Server专为集群中的分布式计算而设计。当培训分布在多台计算机上或在某些情况下跨多台GPU分布在一台计算机上时，这尤其有用。来自文档：

进程内TensorFlow服务器，用于分布式培训。

tf.train.Server实例封装了一组设备和一个可以参与分布式培训的tf.Session目标。服务器属于群集（由tf.train.ClusterSpec指定），并且对应于命名作业中的特定任务。 服务器可以与同一群集中的任何其他服务器通信。

使用multiprocessing.Process生成多个进程并不是Tensorflow意义上的集群，因为子进程不会相互交互。此方法更易于设置，但仅限于一台计算机。既然你说你只有一台机器，这可能不是一个强有力的论据，但如果你计划扩展到一组机器，你将不得不重新设计整个方法。

因此，{p> tf.train.Server是一种更具普遍性和可扩展性的解决方案。此外，它允许使用一些非平凡的通信来组织复杂的训练，例如异步梯度更新。是否更快地训练取决于任务，我认为在一个共享GPU上不会有显着差异。

仅供参考，以下是服务器代码的样子（图形复制示例之间）：

# specify the cluster's architecture
cluster = tf.train.ClusterSpec({
  'ps': ['192.168.1.1:1111'],
  'worker': ['192.168.1.2:1111',
             '192.168.1.3:1111']
})

# parse command-line to specify machine
job_type = sys.argv[1]  # job type: "worker" or "ps"
task_idx = sys.argv[2]  # index job in the worker or ps list as defined in the ClusterSpec

# create TensorFlow Server. This is how the machines communicate.
server = tf.train.Server(cluster, job_name=job_type, task_index=task_idx)

# parameter server is updated by remote clients.
# will not proceed beyond this if statement.
if job_type == 'ps':
  server.join()
else:
  # workers only
  with tf.device(tf.train.replica_device_setter(worker_device='/job:worker/task:' + task_idx,
                                                cluster=cluster)):
    # build your model here as if you only were using a single machine
    pass

  with tf.Session(server.target):
    # train your model here
    pass

为什么要使用tf.train.Server并行执行多个tf.Session（）？

1 个答案: