使用TF 0.12.1,我们试图了解Tensorflow的性能如何破坏。特别是,我们正在研究Inception-v3模型,以及前向传递步骤需要多长时间。
我们看到的第一步是在推理步骤中运行基准测试。为了避免排队时间,我们将训练示例设置为常量张量并通过初始模型运行它。代码中的火车方法如下
def train(dataset):
"""Train on dataset for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (dataset.num_examples_per_epoch() /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
momentum=RMSPROP_MOMENTUM,
epsilon=RMSPROP_EPSILON)
# Get images and labels for ImageNet and split the batch across GPUs.
assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
'Batch size must be divisible by number of GPUs')
split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
num_classes = dataset.num_classes() + 1
# Calculate the gradients for each model tower.
tower_grads = []
reuse_variables = None
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
labels_shape = (FLAGS.batch_size)
images = tf.zeros(image_shape, dtype=tf.float32)
labels = tf.zeros(labels_shape, dtype=tf.int32)
logits = _tower_loss(images, labels, num_classes,
scope, reuse_variables)
# Reuse variables for the next tower.
reuse_variables = True
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
loss_value = sess.run(logits)
duration = time.time() - start_time
examples_per_sec = FLAGS.batch_size / float(duration)
format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step,
examples_per_sec, duration))
对于8个GPU,批量大小为32和1个param服务器,我们观察到每个logits操作0.44秒进行正向传递。但是,当我们运行时间轴工具时,我们观察到的推理时间要小得多(见下图)。对于GPU运行时,请注意存在初始突发,然后是中断,然后是更长的GPU突发。我们假设初始突发是前向传递,而第二次突发是反向传播。
如果初始爆发确实是正向通过时间,则它基本上小于0.44秒。谁能解释这些结果之间的差异?基准测试应用程序是错误还是时间线工具无法全面了解?此外,在我们无法解释的第一个大爆发之前有几个GPU操作。任何对此的见解都将非常感激!
答案 0 :(得分:0)
自TF 0.12.1以来,TensorFlow经历了许多重大的性能提升。如果您对可靠的性能数字感兴趣,请在发布时使用最新版本的TensorFlow或版本1.2。
如果您希望以高性能模型为起点,我强烈建议您使用包含Inception-v3模型的https://github.com/tensorflow/benchmarks。
至于尝试理解单个步骤的详细性能,我建议使用C ++ TensorFlow运行时。 (Python中的开销很大,可能会给您的测量带来不确定性。)
此外,重复运行实验多次以允许系统“预热”并完全初始化。
有一点需要注意:如果您要调整模型,请务必避免设置allow_soft_placement=True
。目前,最好确保您期望的所有操作都真正放在GPU上。您可以通过查看log_device_placement
参数控制的日志输出来确认。