Question

我理解tensorflow分布式培训，我实现了自己的脚本。

我现在要做的是整合为一些工人分配异步评估模型的任务的可能性。

假设我们有6名工人，我想做的是使用其中4名进行异步培训，1名用于定期评估模型，另一项用于定期对其进行推理。

我实现这一目标的直觉是做到以下几点：

....
elif FLAGS.job_name == "worker":

    if FLAGS.task_index <= (len(cluster_dict["worker"][:-2]) - 1):
         logging.info("Training worker started")
         ...
        with tf.device(tf.train.replica_device_setter(
                worker_device="/job:worker/task:%d" % FLAGS.task_index,
                cluster=cluster,
                ps_tasks=len(cluster_dict["ps"])
            )):
                train_model = Model(
                    mode=tf.contrib.learn.ModeKeys.TRAIN
                )
               with tf.train.MonitoredTrainingSession(
                    is_chief=(FLAGS.task_index == 0),
                    master=server.target,
                    checkpoint_dir=ckpt_dir,
                    config=config_proto,
                    hooks=hooks
                ) as mon_sess:
                    while not mon_sess.should_stop():
                        res = train_model.train(...)
                        ...

   elif FLAGS.task_index == (len(cluster_dict["worker"][-2]) - 1):
         logging.info("Evaluation worker started")
         ...
         with tf.device(tf.train.replica_device_setter(
              worker_device="/job:worker/task:%d" % FLAGS.task_index,
              cluster=cluster,
              ps_tasks=len(cluster_dict["ps"])
          )):
              eval_model = Model(
                 mode=tf.contrib.learn.ModeKeys.EVAL
              )
              ...

   elif FLAGS.task_index == (len(cluster_dict["worker"][-1]) - 1):
        logging.info("Inference worker started")
        ...
        with tf.device(tf.train.replica_device_setter(
               worker_device="/job:worker/task:%d" % FLAGS.task_index,
               cluster=cluster,
               ps_tasks=len(cluster_dict["ps"])
            )):
                infer_model = Model(
                    mode=tf.contrib.learn.ModeKeys.INFER
                )
                ...

现在，评估和推理会议怎么样？对于培训，我可以使用tf.train.MonitoredTrainingSession，但是为了评估和推断，我没有看到这样一个舒适的解决方案，我看到的唯一可能是使用tf.Session。

关于实际评估和推理循环，我想使用while循环，其中工作人员定期调用eval_model.eval(...)或infer_model.infer(...)，但这意味着评估是在考虑时间而不是考虑的情况下执行的global_step是我能给予＆＃34;周期性的唯一意义＆＃34;就是睡觉了。

在分布式张量流设置中异步执行训练，评估和推理是否正确？

如何在tensorflow分布式培训中进行评估和推理？

0 个答案: