我正在尝试编写分布式张量流程序,但对我来说有些奇怪,尤其是多分布式工作者导出模型预测结果几乎为0.00003(最大值:0.000031,最小值:0.00000001,变量:0.00001,平均值:0.00003,学习rate是0.00125,model:logistic regression,sigmoid actvation as last),当我只使用一个worker进程时它工作正常(预测结果平均值:0.478,var:0.41,max:0.999,min:0.00000001)。
我的代码结构如下:
with K.tf.device(
K.tf.train.replica_device_setter(
ps_device="/job:servers",
worker_device="/job:%s/task:%d/cpu:0" % (
self.job_name + 's', self.train_id),
cluster=self.cluster)):
self._build_model(self.job_name)
self._build_data_pipeline(self.job_name, role=['train'])
self._build_monitors(self.job_name)
self._build_algo(self.job_name)
self._build_summaries(self.job_name)
self._build_save_model(self.job_name)
self._build_train_conf(self.job_name)
self._build_supervisor(self.job_name)
我的主管和managed_session代码如下:
logging.info('[is_chief] %s', is_chief)
sv = K.tf.train.Supervisor(
is_chief=is_chief,
logdir=self.model_conf['export_dir'] + '/save',
init_op=self.init_op,
summary_op=self.summary_op,
save_model_secs=0,
save_summaries_secs=self.save_summaries_seconds,
saver=self.saver,
global_step=self.global_step)
logging.info('sess target: %s', self.server.target)
with sv.managed_session(
master=self.server.target,
config=threads_config,
start_standard_services=False) as sess:
sv.start_queue_runners(sess)
do the training steps.
when finished input data, export model manually.
但我发现主会话发生了两次,这是什么意思?
I 2017-04-11 08:59:38 framework.framework:3020 ============After build model==============
I 2017-04-11 08:59:38 framework.framework:3027 [is_chief] False
I 2017-04-11 08:59:38 framework.framework:2077 RUNTIME CHECKING ...
I 2017-04-11 08:59:38 framework.framework:2026 [CHECKER] ModelSize Checking ...
I 2017-04-11 08:59:38 framework.framework:2064 [CHECKER] GraphNotNone Checking ...
I 2017-04-11 08:59:38 framework.framework:2082 Runtime Checker result: True
I 2017-04-11 08:59:38 framework.framework:3094 sess target: b'grpc://localhost:15926'
I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session 74d502743e860d45 with config:
use_per_session_threads: true
I tensorflow/core/distributed_runtime/master_session.cc:928] Start master session 12114aca8d44d336 with config:
use_per_session_threads: true