Question

使用分布式张量流运行Alexnet不会按照图像数/秒进行缩放。我在这里使用alexnet模型alexnet_benchmark.py并对EC2 G2（NVIDIA GRID K520）实例上的分布式培训进行了一些修改，我发现它可以在单个GPU，单个主机上处理5 6个图像/秒，但运行没有分布式代码可以在单个GPU上处理112张图像/秒。这看起来很奇怪，你能否回顾一下这个代码运行它分布的错误？参数服务器不在GPU上运行，但工作人员使用CUDA_VISIBLE_DEVICES前缀运行

ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")

# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# Create and start a server for the local task.
server = tf.train.Server(cluster,
                   job_name=FLAGS.job_name,
                   task_index=FLAGS.task_index)

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":

    gpu = FLAGS.task_index % 4

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        #'/gpu:%d' % i
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        #worker_device='/gpu:%d' % gpu,
        cluster=cluster)):

        summary_op = tf.merge_all_summaries()

        y, x = get_graph()

        y_ = tf.placeholder(tf.float32, [None, NUM_LABELS])

        cross_entropy = tf.reduce_mean( -tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]) )

        global_step = tf.Variable(0)

        gradient_descent_opt = tf.train.GradientDescentOptimizer(LEARNING_RATE)

        num_workers = len(worker_hosts)
        sync_rep_opt = tf.train.SyncReplicasOptimizer(gradient_descent_opt, replicas_to_aggregate=num_workers,
                replica_id=FLAGS.task_index, total_num_replicas=num_workers)

        train_op = sync_rep_opt.minimize(cross_entropy, global_step=global_step)

        init_token_op = sync_rep_opt.get_init_tokens_op()
        chief_queue_runner = sync_rep_opt.get_chief_queue_runner()

        #saver = tf.train.Saver()
        summary_op = tf.merge_all_summaries()

        init_op = tf.initialize_all_variables()
        saver = tf.train.Saver()

    is_chief=(FLAGS.task_index == 0)

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                             #logdir="/tmp/train_logs",
                             init_op=init_op,
                             summary_op=summary_op,
                             saver=saver,
                             global_step=global_step)
                             #save_model_secs=600)

    # The supervisor takes care of session initialization, restoring from
    # a checkpoint, and closing when done or an error occurs.
    with sv.managed_session(server.target) as sess:

        if is_chief:
            sv.start_queue_runners(sess, [chief_queue_runner])
            sess.run(init_token_op)

        num_steps_burn_in = 1000
        total_duration = 0
        total_duration_squared = 0
        step = 0

        while step <= 40000:

            print('Iteration %d' % step)
            sys.stdout.flush()
            batch_xs, batch_ys = get_data(BATCH_SIZE)
            train_feed = {x: batch_xs, y_: batch_ys}

            start_time = time.time()

            _, step = sess.run([train_op, global_step], feed_dict=train_feed)

            duration = time.time() - start_time
            if step > num_steps_burn_in:
                total_duration += duration
                total_duration_squared += duration * duration

                if not step % 1000:
                    iterations = step - num_steps_burn_in
                    images_processed = BATCH_SIZE * iterations
                    print('%s: step %d, images processed: %d, images per second: %.3f, time taken: %.2f' %
                            (datetime.now(), iterations, images_processed, images_processed/total_duration, total_duration))
                    sys.stdout.flush()
    sv.stop()

Answer 1

您的代码看起来不错 - 请注意以下几点：

在单个节点和多个节点之间创建的图表是不同的，比较可能具有与之相关的一些变化。添加了添加的队列和同步，以便向服务器和工作人员传输梯度信息。
由于Alexnet具有相对快速的前向和后向传递，这将使进出服务器的I / O传输的开销更加突出。这可能会也可能不会出现在初始V3（倾向于可能不会）。
您的帖子提到您正在为参数服务器和工作者使用单独的EC2实例;这是最好的配置。在同一节点上运行工作服务器和服务器肯定会对性能产生很大影响。
对于增加工人，无疑必须增加为工作人员服务的服务器数量。在开始时，这开始发生在32名独立工人之后。
请记住，在大约16名工人之后，有证据表明收敛可能会受到影响。

我的建议是尝试分发初始V3。与单节点计数器部分相比，此拓扑应具有近乎完美的可扩展性。如果是这样，您的硬件设置是好的;如果它没有仔细检查您的硬件配置。

如果您正在进行可伸缩性研究，我建议您从一个参数服务器和一个独立实例上的工作人员开始相对性能收集，与单个节点相比，运行会有变化。

alexnet分布了tensorflow性能

1 个答案: