问题
tf.estimator.Estimator
训练自定义张量流1.11 tf.data.Dataset
的运行要比使用具有相同模型架构的tf.keras
和直接馈送数据的运行慢得多可能的原因
我尝试了什么
遵循Input Pipeline Performance Guide中的步骤。这加快了不在时代边界附近的批次。我不确定如何进一步改善。
最小的工作示例
import numpy as np
import pandas as pd
import tensorflow as tf
train_data = pd.DataFrame(np.random.randn(1030255, 1021)).rename(columns={c:str(c) for c in range(1021)})
train_target = pd.DataFrame(np.round(np.random.rand(1030255, 15))).rename(columns={c:str(c) for c in range(15)})
val_data = pd.DataFrame(np.random.randn(491077, 1021)).rename(columns={c:str(c) for c in range(1021)})
val_target = pd.DataFrame(np.round(np.random.rand(491077, 15))).rename(columns={c:str(c) for c in range(15)})
def model_fn(features, labels, mode,params):
net = tf.feature_column.input_layer(features, params['feature_columns'])
net = tf.layers.dense(net, units=params['hidden_units'], activation=tf.nn.tanh)
logits = tf.layers.dense(net, params['outputs'], activation=None)
loss = tf.losses.sigmoid_cross_entropy(labels, logits=logits)
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss)
elif mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
cols = [str(c) for c in np.sort(np.random.choice(1021, size=98, replace=False))]
target_cols = [str(c) for c in np.arange(11)]
VALIDATION_BATCH_SIZE = int(val_data.shape[0] / 4.0)
BATCH_SIZE = 2**10
def train_input_fn():
features = {k: train_data[k].values for k in cols}
dataset = tf.data.Dataset.from_tensor_slices((features, train_target[target_cols].values))
dataset = dataset.repeat() \
.shuffle(train_data.shape[0]) \
.batch(BATCH_SIZE) \
.prefetch(BATCH_SIZE)
return dataset
def validation_input_fn():
features = {k: val_data[k].values for k in cols}
dataset = tf.data.Dataset.from_tensor_slices((features, val_target[target_cols].values))
dataset = dataset.repeat() \
.batch(VALIDATION_BATCH_SIZE) \
.prefetch(VALIDATION_BATCH_SIZE)
return dataset
feature_columns = [tf.feature_column.numeric_column(f) for f in cols]
run_cfg = tf.estimator.RunConfig(tf_random_seed=1, save_checkpoints_steps=1000, save_checkpoints_secs=None)
classifier = tf.estimator.Estimator(
model_fn=model_fn,
config=run_cfg,
params={
'feature_columns': feature_columns,
'hidden_units': 128,
'outputs': len(target_cols)
})
tf.estimator.train_and_evaluate(
classifier,
train_spec=tf.estimator.TrainSpec(train_input_fn),
eval_spec=tf.estimator.EvalSpec(validation_input_fn, steps=4, start_delay_secs=30, throttle_secs=30))
示例输出
(由于张量流与最后一个训练步骤有所不同,第一步花费的时间被误认为是巨大的时间。但是,在开始评估之前,最后一个训练步骤并非如此。)
INFO:tensorflow:Finished evaluation at 2018-10-31-08:17:35
INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 0.694723
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: /tmp/tmpvQq9DV/model.ckpt-1000
INFO:tensorflow:global_step/sec: 0.930553
INFO:tensorflow:loss = 0.69407356, step = 1000 (107.463 sec)
INFO:tensorflow:global_step/sec: 171.673
INFO:tensorflow:loss = 0.69456786, step = 1100 (0.583 sec)
INFO:tensorflow:global_step/sec: 166.964
INFO:tensorflow:loss = 0.69411445, step = 1200 (0.599 sec)
INFO:tensorflow:global_step/sec: 172.226
INFO:tensorflow:loss = 0.6940959, step = 1300 (0.579 sec)
INFO:tensorflow:global_step/sec: 170.882
INFO:tensorflow:loss = 0.69440323, step = 1400 (0.586 sec)
INFO:tensorflow:global_step/sec: 173.453
INFO:tensorflow:loss = 0.69332886, step = 1500 (0.577 sec)
INFO:tensorflow:global_step/sec: 167.078
INFO:tensorflow:loss = 0.6950055, step = 1600 (0.598 sec)
INFO:tensorflow:global_step/sec: 159.763
INFO:tensorflow:loss = 0.69460225, step = 1700 (0.626 sec)
INFO:tensorflow:global_step/sec: 161.674
INFO:tensorflow:loss = 0.6940766, step = 1800 (0.617 sec)
INFO:tensorflow:global_step/sec: 8.83793
INFO:tensorflow:loss = 0.6936994, step = 1900 (11.315 sec)
INFO:tensorflow:Saving checkpoints for 2000 into /tmp/tmpvQq9DV/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-31-08:18:11
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpvQq9DV/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/4]
INFO:tensorflow:Evaluation [2/4]
INFO:tensorflow:Evaluation [3/4]
INFO:tensorflow:Evaluation [4/4]
INFO:tensorflow:Finished evaluation at 2018-10-31-08:19:36
没有评估步骤
我认为评估步骤(其本身非常缓慢)可能是问题所在。但是当我只训练时,它的运行速度甚至会更慢:
classifier.train(input_fn=train_input_fn, steps=5000)
[...]
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmpIJctNp/model.ckpt.
INFO:tensorflow:global_step/sec: 6.95414
INFO:tensorflow:loss = 0.6948092, step = 1000 (14.381 sec)
INFO:tensorflow:global_step/sec: 12.0491
INFO:tensorflow:loss = 0.6942879, step = 1100 (8.298 sec)
INFO:tensorflow:global_step/sec: 8.98388
INFO:tensorflow:loss = 0.6939402, step = 1200 (11.131 sec)
INFO:tensorflow:global_step/sec: 8.86219
INFO:tensorflow:loss = 0.6946343, step = 1300 (11.284 sec)
INFO:tensorflow:global_step/sec: 8.74248
INFO:tensorflow:loss = 0.694865, step = 1400 (11.439 sec)
在此先感谢您的帮助。