Question

使用TF 0.12.1，我们试图了解Tensorflow的性能如何破坏。特别是，我们正在研究Inception-v3模型，以及前向传递步骤需要多长时间。

我们看到的第一步是在推理步骤中运行基准测试。为了避免排队时间，我们将训练示例设置为常量张量并通过初始模型运行它。代码中的火车方法如下

def train(dataset):
  """Train on dataset for a number of steps."""
  with tf.Graph().as_default(), tf.device('/cpu:0'):
    # Create a variable to count the number of train() calls. This equals the
    # number of batches processed * FLAGS.num_gpus.
    global_step = tf.get_variable(
        'global_step', [],
        initializer=tf.constant_initializer(0), trainable=False)

    # Calculate the learning rate schedule.
    num_batches_per_epoch = (dataset.num_examples_per_epoch() /
                             FLAGS.batch_size)
    decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)

    # Decay the learning rate exponentially based on the number of steps.
    lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
                                    global_step,
                                    decay_steps,
                                    FLAGS.learning_rate_decay_factor,
                                    staircase=True)

    # Create an optimizer that performs gradient descent.
    opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
                                    momentum=RMSPROP_MOMENTUM,
                                    epsilon=RMSPROP_EPSILON)

    # Get images and labels for ImageNet and split the batch across GPUs.
    assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
        'Batch size must be divisible by number of GPUs')
    split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
    num_classes = dataset.num_classes() + 1

    # Calculate the gradients for each model tower.
    tower_grads = []
    reuse_variables = None
    for i in xrange(FLAGS.num_gpus):
      with tf.device('/gpu:%d' % i):
        with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
          # Force all Variables to reside on the CPU.
          with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
            # Calculate the loss for one tower of the ImageNet model. This
            # function constructs the entire ImageNet model but shares the
            # variables across all towers.
            image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
            labels_shape = (FLAGS.batch_size)
            images = tf.zeros(image_shape, dtype=tf.float32)
            labels = tf.zeros(labels_shape, dtype=tf.int32)

            logits = _tower_loss(images, labels, num_classes,
                               scope, reuse_variables)

          # Reuse variables for the next tower.
          reuse_variables = True

    # Build an initialization operation to run below.
    init = tf.initialize_all_variables()

    # Start running operations on the Graph. allow_soft_placement must be set to
    # True to build towers on GPU, as some of the ops do not have GPU
    # implementations.
    sess = tf.Session(config=tf.ConfigProto(
        allow_soft_placement=True,
        log_device_placement=FLAGS.log_device_placement))
    sess.run(init)

    # Start the queue runners.
    tf.train.start_queue_runners(sess=sess)

    for step in xrange(FLAGS.max_steps):
      start_time = time.time()
      loss_value = sess.run(logits)
      duration = time.time() - start_time
      examples_per_sec = FLAGS.batch_size / float(duration)
      format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
                      'sec/batch)')
      print(format_str % (datetime.now(), step,
                          examples_per_sec, duration))

对于8个GPU，批量大小为32和1个param服务器，我们观察到每个logits操作0.44秒进行正向传递。但是，当我们运行时间轴工具时，我们观察到的推理时间要小得多（见下图）。对于GPU运行时，请注意存在初始突发，然后是中断，然后是更长的GPU突发。我们假设初始突发是前向传递，而第二次突发是反向传播。

如果初始爆发确实是正向通过时间，则它基本上小于0.44秒。谁能解释这些结果之间的差异？基准测试应用程序是错误还是时间线工具无法全面了解？此外，在我们无法解释的第一个大爆发之前有几个GPU操作。任何对此的见解都将非常感激！

Answer 1

自TF 0.12.1以来，TensorFlow经历了许多重大的性能提升。如果您对可靠的性能数字感兴趣，请在发布时使用最新版本的TensorFlow或版本1.2。

如果您希望以高性能模型为起点，我强烈建议您使用包含Inception-v3模型的https://github.com/tensorflow/benchmarks。

至于尝试理解单个步骤的详细性能，我建议使用C ++ TensorFlow运行时。（Python中的开销很大，可能会给您的测量带来不确定性。）

此外，重复运行实验多次以允许系统“预热”并完全初始化。

有一点需要注意：如果您要调整模型，请务必避免设置allow_soft_placement=True。目前，最好确保您期望的所有操作都真正放在GPU上。您可以通过查看log_device_placement参数控制的日志输出来确认。

通过时间轴和基准测试来分解Tensorflow性能

1 个答案: