使用tf.estimator.Estimator时,步数不匹配

时间:2017-08-02 19:24:13

标签: tensorflow

我正在弄清楚TensorFlow估算框架。我终于有了一个训练模型的代码。我正在使用一个简单的MNIST自动编码器进行测试。我有两个问题。第一个问题是为什么训练报告的步骤数量与我在估算器train()方法中指定的步骤数量不同?第二个是如何使用训练钩来做周期性评估,每X步丢失输出等事情?文档似乎说要使用训练钩子,但我似乎找不到任何关于如何使用这些钩子的实际例子。

这是我的代码:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import shutil
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from IPython import display
from tensorflow.examples.tutorials.mnist import input_data

data = input_data.read_data_sets('.')
display.clear_output()

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

_train_input_fn = tf.estimator.inputs.numpy_input_fn({'images': data.train.images},
                                                     y=np.array(data.train.images),
                                                     batch_size=100,
                                                     shuffle=True)

shutil.rmtree("logs", ignore_errors=True)
tf.logging.set_verbosity(tf.logging.INFO)
estimator = tf.estimator.Estimator(_model_fn, 
                                   model_dir="logs", 
                                   config=tf.contrib.learn.RunConfig(save_checkpoints_steps=1000),
                                   params={})
estimator.train(_train_input_fn, steps=1000)

这是我得到的输出(注意训练在550步骤停止,代码明确要求1000)

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12b9fa630>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_session_config': None, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': 'logs'}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into logs/model.ckpt.
INFO:tensorflow:loss = 0.102862, step = 1
INFO:tensorflow:global_step/sec: 41.8119
INFO:tensorflow:loss = 0.0191228, step = 101 (2.393 sec)
INFO:tensorflow:global_step/sec: 39.9923
INFO:tensorflow:loss = 0.0141014, step = 201 (2.500 sec)
INFO:tensorflow:global_step/sec: 40.9806
INFO:tensorflow:loss = 0.0116138, step = 301 (2.440 sec)
INFO:tensorflow:global_step/sec: 40.0043
INFO:tensorflow:loss = 0.00998991, step = 401 (2.500 sec)
INFO:tensorflow:global_step/sec: 39.2571
INFO:tensorflow:loss = 0.0124132, step = 501 (2.548 sec)
INFO:tensorflow:Saving checkpoints for 550 into logs/model.ckpt.
INFO:tensorflow:Loss for final step: 0.00940801.

<tensorflow.python.estimator.estimator.Estimator at 0x12b9fa780>

更新#1 我找到了第一个问题的答案。训练在步骤550停止的原因是因为numpy_input_fn()默认为num_epochs = 1。我仍在寻找训练钩子的帮助。

2 个答案:

答案 0 :(得分:0)

看来您已经完成了550步的所有数据处理。 numpy_input_fn的默认“ num_epochs”参数为1,因此您只能运行一次数据。 https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn

因此,您应该将num_epochs设置为None来满足您的步骤。

答案 1 :(得分:0)

估算器可以在3种模式下运行。

  1. 火车
  2. 评估
  3. 预测

您当前的代码仅配置为在训练模式下运行。如果要包括评估步骤,则必须首先对模型函数进行一些更改:

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

    prec, prec_update_op = tf.metrics.precision(labels=labels,predictions=model), name='precision_op')
    recall, recall_update_op = tf.metrics.recall(labels=labels, predictions=model, name='recall_op')

    metrics={'recall':(recall, recall_update_op), \
               'precision':(prec, prec_update_op)}

    if mode==tf.estimator.ModeKeys.EVAL:
          return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

现在每10步进行评估和打印损失输出。

configuration = tf.estimator.RunConfig(
  model_dir = 'logs',
  keep_checkpoint_max=5,
  save_checkpoints_steps=1500,
  log_step_count_steps=10)  # set the frequency of logging steps for loss function

estimator = tf.estimator.Estimator(model_fn = _model_fn, params = {}, config=configuration)

train_spec = tf.estimator.TrainSpec(input_fn=_train_input_fn, steps=5000) 
eval_spec = tf.estimator.EvalSpec(input_fn=_train_input_fn, steps=100, throttle_secs=600)

tf.estimator.train_and_evaluate(classifier, train_spec, eval_spec)

注意:

  1. 在保存每个新的检查点后(即每1500个步骤),进行100个步骤的评估,然后恢复训练。
  2. log_step_count_steps 每隔X步打印一次损失输出。
  3. 参数 throttle_secs 定义了两个连续评估步骤之间的最小秒数。如果在此秒数之前存储了新的检查点,则将跳过评估。

以上内容将在同一个数据集上进行训练和评估,如果您希望在另一个数据集上完成该操作,然后将其(数据集的)合适的输入函数传递给 input_fn > tf.estimator.EvalSpec