tensorflow分布式列车工作人员进程主管初始化两次

时间:2017-04-11 09:20:00

标签: session tensorflow distributed managed supervisor

我正在尝试编写分布式张量流程序,但对我来说有些奇怪,尤其是多分布式工作者导出模型预测结果几乎为0.00003(最大值:0.000031,最小值:0.00000001,变量:0.00001,平均值:0.00003,学习rate是0.00125,model:logistic regression,sigmoid actvation as last),当我只使用一个worker进程时它工作正常(预测结果平均值:0.478,var:0.41,max:0.999,min:0.00000001)。

我的代码结构如下:

        with K.tf.device(
                K.tf.train.replica_device_setter(
                    ps_device="/job:servers",
                    worker_device="/job:%s/task:%d/cpu:0" % (
                        self.job_name + 's', self.train_id),
                    cluster=self.cluster)):
            self._build_model(self.job_name)
            self._build_data_pipeline(self.job_name, role=['train'])
            self._build_monitors(self.job_name)
            self._build_algo(self.job_name)
            self._build_summaries(self.job_name)
            self._build_save_model(self.job_name)
            self._build_train_conf(self.job_name)
        self._build_supervisor(self.job_name)

我的主管和managed_session代码如下:

    logging.info('[is_chief] %s', is_chief)
    sv = K.tf.train.Supervisor(
        is_chief=is_chief,
        logdir=self.model_conf['export_dir'] + '/save',
        init_op=self.init_op,
        summary_op=self.summary_op,
        save_model_secs=0,
        save_summaries_secs=self.save_summaries_seconds,
        saver=self.saver,
        global_step=self.global_step)


    logging.info('sess target: %s', self.server.target)
    with sv.managed_session(
            master=self.server.target,
            config=threads_config,
            start_standard_services=False) as sess:
        sv.start_queue_runners(sess)

        do the training steps.
        when finished input data, export model manually.

但我发现主会话发生了两次,这是什么意思?

I 2017-04-11 08:59:38 framework.framework:3020 ============After build model==============
I 2017-04-11 08:59:38 framework.framework:3027 [is_chief] False
I 2017-04-11 08:59:38 framework.framework:2077 RUNTIME CHECKING ...
I 2017-04-11 08:59:38 framework.framework:2026 [CHECKER] ModelSize Checking ...
I 2017-04-11 08:59:38 framework.framework:2064 [CHECKER] GraphNotNone Checking ...
I 2017-04-11 08:59:38 framework.framework:2082 Runtime Checker result: True
I 2017-04-11 08:59:38 framework.framework:3094 sess target: b'grpc://localhost:15926'
I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session 74d502743e860d45 with config:
use_per_session_threads: true

I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session 12114aca8d44d336 with config:
use_per_session_threads: true

0 个答案:

没有答案