Tensorflow 1.11估计器+数据集训练速度低且不规则

时间:2018-10-31 09:21:54

标签: python tensorflow tensorflow-datasets tensorflow-estimator

问题

  • 使用tf.estimator.Estimator训练自定义张量流1.11 tf.data.Dataset的运行要比使用具有相同模型架构的tf.keras和直接馈送数据的运行慢得多
  • 但是,它有时确实可以快速运行(以global_step / sec计),但是在开始和结束时会很慢。
  • 在“快速”批量生产中,GPU利用率约为30%。慢的时候大约1%

可能的原因

  • 我的输入管道没有充分优化,以致GPU处于空闲状态,等待CPU数据处理。但是我不明白为什么只在情节结束前的最后几个小批处理中

我尝试了什么

遵循Input Pipeline Performance Guide中的步骤。这加快了不在时代边界附近的批次。我不确定如何进一步改善。

最小的工作示例

import numpy as np
import pandas as pd
import tensorflow as tf

train_data = pd.DataFrame(np.random.randn(1030255, 1021)).rename(columns={c:str(c) for c in range(1021)})
train_target = pd.DataFrame(np.round(np.random.rand(1030255, 15))).rename(columns={c:str(c) for c in range(15)})
val_data = pd.DataFrame(np.random.randn(491077, 1021)).rename(columns={c:str(c) for c in range(1021)})
val_target = pd.DataFrame(np.round(np.random.rand(491077, 15))).rename(columns={c:str(c) for c in range(15)})

def model_fn(features, labels, mode,params): 
    net = tf.feature_column.input_layer(features, params['feature_columns'])
    net = tf.layers.dense(net, units=params['hidden_units'], activation=tf.nn.tanh)
    logits = tf.layers.dense(net, params['outputs'], activation=None)
    loss = tf.losses.sigmoid_cross_entropy(labels, logits=logits)
    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, loss=loss)
    elif mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer()
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

cols = [str(c) for c in np.sort(np.random.choice(1021, size=98, replace=False))]
target_cols = [str(c) for c in np.arange(11)]
VALIDATION_BATCH_SIZE = int(val_data.shape[0] / 4.0)
BATCH_SIZE = 2**10

def train_input_fn():
    features = {k: train_data[k].values for k in cols}
    dataset = tf.data.Dataset.from_tensor_slices((features, train_target[target_cols].values))
    dataset = dataset.repeat() \
        .shuffle(train_data.shape[0]) \
        .batch(BATCH_SIZE) \
        .prefetch(BATCH_SIZE)
    return dataset

def validation_input_fn():
    features = {k: val_data[k].values for k in cols}
    dataset = tf.data.Dataset.from_tensor_slices((features, val_target[target_cols].values))
    dataset = dataset.repeat() \
        .batch(VALIDATION_BATCH_SIZE) \
        .prefetch(VALIDATION_BATCH_SIZE)
    return dataset

feature_columns = [tf.feature_column.numeric_column(f) for f in cols]
run_cfg = tf.estimator.RunConfig(tf_random_seed=1, save_checkpoints_steps=1000, save_checkpoints_secs=None)
classifier = tf.estimator.Estimator(
        model_fn=model_fn,
        config=run_cfg,
        params={
            'feature_columns': feature_columns,
            'hidden_units': 128,
            'outputs': len(target_cols)
        })
tf.estimator.train_and_evaluate(
    classifier,
    train_spec=tf.estimator.TrainSpec(train_input_fn),
    eval_spec=tf.estimator.EvalSpec(validation_input_fn, steps=4, start_delay_secs=30, throttle_secs=30))

示例输出

(由于张量流与最后一个训练步骤有所不同,第一步花费的时间被误认为是巨大的时间。但是,在开始评估之前,最后一个训练步骤并非如此。)

INFO:tensorflow:Finished evaluation at 2018-10-31-08:17:35
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 0.694723
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: /tmp/tmpvQq9DV/model.ckpt-1000
INFO:tensorflow:global_step/sec: 0.930553
INFO:tensorflow:loss = 0.69407356, step = 1000 (107.463 sec)
INFO:tensorflow:global_step/sec: 171.673
INFO:tensorflow:loss = 0.69456786, step = 1100 (0.583 sec)
INFO:tensorflow:global_step/sec: 166.964
INFO:tensorflow:loss = 0.69411445, step = 1200 (0.599 sec)
INFO:tensorflow:global_step/sec: 172.226
INFO:tensorflow:loss = 0.6940959, step = 1300 (0.579 sec)
INFO:tensorflow:global_step/sec: 170.882
INFO:tensorflow:loss = 0.69440323, step = 1400 (0.586 sec)
INFO:tensorflow:global_step/sec: 173.453
INFO:tensorflow:loss = 0.69332886, step = 1500 (0.577 sec)
INFO:tensorflow:global_step/sec: 167.078
INFO:tensorflow:loss = 0.6950055, step = 1600 (0.598 sec)
INFO:tensorflow:global_step/sec: 159.763
INFO:tensorflow:loss = 0.69460225, step = 1700 (0.626 sec)
INFO:tensorflow:global_step/sec: 161.674
INFO:tensorflow:loss = 0.6940766, step = 1800 (0.617 sec)
INFO:tensorflow:global_step/sec: 8.83793
INFO:tensorflow:loss = 0.6936994, step = 1900 (11.315 sec)
INFO:tensorflow:Saving checkpoints for 2000 into /tmp/tmpvQq9DV/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-31-08:18:11
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpvQq9DV/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/4]
INFO:tensorflow:Evaluation [2/4]
INFO:tensorflow:Evaluation [3/4]
INFO:tensorflow:Evaluation [4/4]
INFO:tensorflow:Finished evaluation at 2018-10-31-08:19:36

没有评估步骤

我认为评估步骤(其本身非常缓慢)可能是问题所在。但是当我只训练时,它的运行速度甚至会更慢:

classifier.train(input_fn=train_input_fn, steps=5000)

[...]
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmpIJctNp/model.ckpt.
INFO:tensorflow:global_step/sec: 6.95414
INFO:tensorflow:loss = 0.6948092, step = 1000 (14.381 sec)
INFO:tensorflow:global_step/sec: 12.0491
INFO:tensorflow:loss = 0.6942879, step = 1100 (8.298 sec)
INFO:tensorflow:global_step/sec: 8.98388
INFO:tensorflow:loss = 0.6939402, step = 1200 (11.131 sec)
INFO:tensorflow:global_step/sec: 8.86219
INFO:tensorflow:loss = 0.6946343, step = 1300 (11.284 sec)
INFO:tensorflow:global_step/sec: 8.74248
INFO:tensorflow:loss = 0.694865, step = 1400 (11.439 sec)

在此先感谢您的帮助。

0 个答案:

没有答案