我正在尝试运行distributed tensorflow。但我有一些麻烦。 首先,它可以在单个GPU(GTX TITAN X),单个主机(intel E5-2630 v3)上处理35个图像/秒,但是使用分布式代码运行它只能在4个GPU上处理每个进程26个图像/秒,单个主办。此外,它可以在2台主机上处理8.5张图像/秒,每台主机有4个GPU。所以这个分布式版本的性能似乎很差。有人可以给我一些建议,说明为什么我得到这么差的结果。 其次,我想知道更多ps服务器是否可以提高性能。所以我尝试使用2 ps服务器,程序被阻止了日志信息:
CreateSession仍在等待来自worker的响应:/ job:ps / replica:0 / task:1
我在slurm系统上运行程序,所以我使用python多处理模型来启动ps服务器。
def get_slurm_env():
node_list = expand_hostlist(os.environ['SLURM_NODELIST'])
node_id = int(os.environ['SLURM_NODEID'])
tasks_per_node = int(os.environ['SLURM_NTASKS_PER_NODE'])
# It is difficult to assign the port and gpu id in slurm env.
# The assigned gpu in different host is not always the same, and you nerver know
# which gpu is assigned in another host.
# Different slurm job may run in the same machine, so the port num may be conflict as well
task_id = int(os.environ['SLURM_PROCID'])
task_num = int(os.environ['SLURM_NTASKS'])
visible_gpu_ids = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
visible_gpu_ids = [int(gpu) for gpu in visible_gpu_ids]
worker_port_list=[FLAGS.worker_port_start + incr for incr in range(len(visible_gpu_ids))]
FLAGS.worker_hosts = ["%s:%d" % (name, port) for name in node_list for port in worker_port_list]
assert len(FLAGS.worker_hosts) == task_num, 'Job count is not equal %d : %d' % (len(FLAGS.worker_hosts), task_num)
FLAGS.worker_hosts = ','.join(FLAGS.worker_hosts)
FLAGS.ps_hosts = ["%s:%d" % (name, FLAGS.ps_port_start) for name in node_list]
FLAGS.ps_hosts = ','.join(FLAGS.ps_hosts)
FLAGS.job_name = "worker"
FLAGS.task_id = task_id
os.environ['CUDA_VISIBLE_DEVICES'] = str(visible_gpu_ids[task_id%tasks_per_node])
def ps_runner(cluster, task_id):
tf.logging.info('Setup ps process, id: %d' % FLAGS.task_id)
os.environ['CUDA_VISIBLE_DEVICES'] = ""
server = tf.train.Server(cluster, job_name="ps", task_index=task_id)
server.join()
tf.logging.info('Stop ps process, id: %d' % FLAGS.task_id)
def main(unused_args):
get_slurm_env()
# Extract all the hostnames for the ps and worker jobs to construct the
# cluster spec.
ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
tf.logging.info('PS hosts are: %s' % ps_hosts)
tf.logging.info('Worker hosts are: %s' % worker_hosts)
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts,
'worker': worker_hosts})
if FLAGS.task_id == 0:
p = multiprocessing.Process(target = ps_runner, args = ({'ps': ps_hosts,'worker': worker_hosts}, 0))
p.start()
server = tf.train.Server(
{'ps': ps_hosts,
'worker': worker_hosts},
job_name=FLAGS.job_name,
task_index=FLAGS.task_id)
# `worker` jobs will actually do the work.
dataset = ImagenetData(subset=FLAGS.subset)
assert dataset.data_files()
# Only the chief checks for or creates train_dir.
if FLAGS.task_id == 0:
if not tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.MakeDirs(FLAGS.train_dir)
tf.logging.info('Setup worker process, id: %d' % FLAGS.task_id)
inception_distributed_train.train(server.target, dataset, cluster_spec)
答案 0 :(得分:0)
您是否愿意考虑基于MPI的解决方案,这些解决方案不需要针对分布式张量流的代码进行分布式内存特定更改?我们最近使用MaTEx开发了一个用户透明的分布式张量流版本。 https://github.com/matex-org/matex
如果您遇到任何问题,我们将能够为您提供帮助。