我根据“https://www.tensorflow.org/deploy/distributed”的示例实现了图形间复制和异步训练。
然后,我按如下方式设置了两个服务器和一个工作程序。
python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = ps --task_index = 0
python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = ps --task_index = 1
python dnn.py --ps_hosts = localhost:19000,localhost:18000 --worker_hosts = localhost:11000 --job_name = worker --task_index = 0
我对分布式张量流有三个问题。
首先,根据我的程序的张量流时间轴,如下所示,所有计算和变量更新操作 在 ps节点执行 ,而工作节点是空闲的。这对我来说很困惑,因为我认为计算操作应该在工作节点而不是ps节点上执行。有人会帮我这个吗?
distributed tensorflow timeline
其次,在我的程序中使用tf.train.replica_device_setter 仅将CPU分配给服务器。但是,操作在CPU和GPU上都执行。将CPU / GPU分配给服务器的正确方法是什么?
最后但并非最不重要,如果我启动两台服务器和三台工作人员,两台服务器会保存相同的参数副本吗?另外,我想知道这三个工人是否会更新同一图表的渐变。有人会告诉我吗?
P.S。 我使用tf.train.replica_device_setter为设备分配了设备。但是,在示例(https://www.tensorflow.org/deploy/distributed)中,没有设备分配给本地服务器。就我而言,如果我没有将设备分配给本地服务器,则会出现如下错误:
“操作已明确分配给/ job:ps / task:0但可用设备为[/ job:localhost / replica:0 / task:0 / device:CPU:0,/ job:localhost / replica:0 /任务:0 /设备:GPU:0,/ job:localhost / replica:0 / task:0 / device:GPU:1 ...]。确保设备规范引用有效设备。“
我的代码:
def train():
tl = TimeLiner()
#get current servers
ps_hosts = FLAGS.ps_hosts.split(",")
#get current workers
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts,
"worker": worker_hosts})
graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3, allow_growth=True)
config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
allow_soft_placement=False)
#start a server
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index,
config=config)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
with tf.variable_scope(tf.get_variable_scope()):
with tf.device(tf.train.replica_device_setter(ps_device="/job:localhost/replica:0/task:%d/device:CPU:0" % FLAGS.task_index,
worker_device="/job:localhost/replica:0/task:%d/device:GPU:0" % FLAGS.task_index,
cluster=cluster)):
loss = ...
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(loss, global_step=global_step)
sys.stdout.flush()
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
summary_op = tf.summary.merge_all()
hooks = [tf.train.StopAtStepHook(last_step=FLAGS.max_steps)]
total_training = 0
graph_options = tf.GraphOptions(enable_bfloat16_sendrecv=True)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9, allow_growth=True)
config = tf.ConfigProto(graph_options=graph_options, gpu_options=gpu_options, log_device_placement=False,
allow_soft_placement=True)
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir=FLAGS.log_dir,
log_step_count_steps=100000,
hooks=hooks,
config=config) as mon_sess:
mon_sess.run(init_op)
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
while not mon_sess.should_stop():
# run a training step asynchronously
[_, tot_loss, step, summary] = mon_sess.run([train_op, loss, global_step, summary_op],
options=options,
run_metadata=run_metadata)
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
tl.update_timeline(chrome_trace)
tl.save('timeline.json')
提前致谢!
应
答案 0 :(得分:0)
您可能听说过使用tf.Device(cpu:0)设置设备或类似您在开始会话之前设置的设备。试过吗?
答案 1 :(得分:0)
可能是因为您将worker的标志索引放在了copy_device_setter的ps任务的索引中?