我正在使用tf.estimator.train_and_evaluate
来训练DNNClassifier。假设我有一个10,000行的训练数据集,并将批次设置为200。这意味着一个时期将有50个步骤。
def json_serving_input_fn():
inputs = {}
for feat in my_input_columns:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
my_exporter = tf.estimator.FinalExporter('test', json_serving_input_fn)
estimator = tf.estimator.DNNClassifier(
config=config,
feature_columns= my_input_columns,
hidden_units=[100, 70, 50, 25]
)
train_spec = tf.estimator.TrainSpec(
train_input_fn, max_steps=3000)
eval_spec = tf.estimator.EvalSpec(
eval_input_fn,
exporters=[my_exporter],
name='test-eval')
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
train_input_fn
恰好读取一个时期的训练数据。除了max_steps
中的train_spec
以外,其他一切看起来都不错。从我的实验中,如果我将max_steps
设置为小于或等于50,则该模型将导出为模型输出目录的saved_model.pb
文件夹下的export/test/[timeStamp]
。但是,如果我将其设置为大于50,则模型仍会训练,模型输出目录中仍然会有一些输出,但是没有export
文件夹,因此没有saved_model.pb
文件。
我不明白为什么这个max_steps
参数会影响模型导出。任何帮助表示赞赏!
跟进:
添加了my_exporter
的代码。
添加了整个日志。 (我的实际训练数据在一个纪元内有3589步)
('CSV_COLUMN_DEFAULTS is: ', [[''], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0]])
('The labels are: ', ['0', '1'])
INFO:tensorflow:TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {}, u'job': {u'args': [u'--train-directories', u'/Users/git/test/data/train-csv', u'--eval-directories', u'/Users/git/test/data/test-csv', u'--header-file', u'/Users/git/test/data/header.csv', u'--label-column', u'IS_FRAUD', u'--id-columns', u'ID', u'--labels', u'0', u'1', u'--data-type-file', u'/Users/git/test/data/dataType.json', u'--dropout', u'0.1', u'--learning-rate', u'0.1', u'--job-dir', u'output'], u'job_name': u'trainer.task'}, u'task': {}}
Model dir output
Length of args: 2
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11a89a910>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': 'output', '_train_distribute': None, '_save_summary_steps': 100}
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
No args.max_train_steps is set
args.num_epochs is: 1
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into output/model.ckpt.
INFO:tensorflow:loss = 1565.3971, step = 1
INFO:tensorflow:global_step/sec: 141.477
INFO:tensorflow:loss = 9.107332, step = 101 (0.707 sec)
INFO:tensorflow:global_step/sec: 268.697
INFO:tensorflow:loss = 11.751476, step = 201 (0.372 sec)
INFO:tensorflow:global_step/sec: 295.221
INFO:tensorflow:loss = 6.426797, step = 301 (0.339 sec)
INFO:tensorflow:global_step/sec: 278.098
INFO:tensorflow:loss = 9.810878, step = 401 (0.360 sec)
INFO:tensorflow:global_step/sec: 278.104
INFO:tensorflow:loss = 10.924436, step = 501 (0.359 sec)
INFO:tensorflow:global_step/sec: 287.359
INFO:tensorflow:loss = 4.5450187, step = 601 (0.348 sec)
INFO:tensorflow:global_step/sec: 293.911
INFO:tensorflow:loss = 3.6031227, step = 701 (0.340 sec)
INFO:tensorflow:global_step/sec: 284.665
INFO:tensorflow:loss = 17.361057, step = 801 (0.351 sec)
INFO:tensorflow:global_step/sec: 256.988
INFO:tensorflow:loss = 7.1942925, step = 901 (0.389 sec)
INFO:tensorflow:global_step/sec: 276.781
INFO:tensorflow:loss = 9.573512, step = 1001 (0.361 sec)
INFO:tensorflow:global_step/sec: 275.31
INFO:tensorflow:loss = 11.024918, step = 1101 (0.363 sec)
INFO:tensorflow:global_step/sec: 253.897
INFO:tensorflow:loss = 6.0349283, step = 1201 (0.394 sec)
INFO:tensorflow:global_step/sec: 263.648
INFO:tensorflow:loss = 8.061558, step = 1301 (0.379 sec)
INFO:tensorflow:global_step/sec: 247.25
INFO:tensorflow:loss = 7.8820763, step = 1401 (0.404 sec)
INFO:tensorflow:global_step/sec: 276.765
INFO:tensorflow:loss = 10.244268, step = 1501 (0.361 sec)
INFO:tensorflow:global_step/sec: 288.975
INFO:tensorflow:loss = 4.288355, step = 1601 (0.346 sec)
INFO:tensorflow:global_step/sec: 288.279
INFO:tensorflow:loss = 6.974848, step = 1701 (0.347 sec)
INFO:tensorflow:global_step/sec: 292.614
INFO:tensorflow:loss = 2.9634063, step = 1801 (0.342 sec)
INFO:tensorflow:global_step/sec: 289.016
INFO:tensorflow:loss = 1.9827827, step = 1901 (0.346 sec)
INFO:tensorflow:global_step/sec: 296.63
INFO:tensorflow:loss = 9.702525, step = 2001 (0.337 sec)
INFO:tensorflow:global_step/sec: 297.65
INFO:tensorflow:loss = 12.115634, step = 2101 (0.336 sec)
INFO:tensorflow:global_step/sec: 300.723
INFO:tensorflow:loss = 6.0941725, step = 2201 (0.333 sec)
INFO:tensorflow:global_step/sec: 294.678
INFO:tensorflow:loss = 6.589897, step = 2301 (0.339 sec)
INFO:tensorflow:global_step/sec: 293.163
INFO:tensorflow:loss = 7.630992, step = 2401 (0.341 sec)
INFO:tensorflow:global_step/sec: 291.493
INFO:tensorflow:loss = 8.936813, step = 2501 (0.343 sec)
INFO:tensorflow:global_step/sec: 290.751
INFO:tensorflow:loss = 3.8139243, step = 2601 (0.344 sec)
INFO:tensorflow:global_step/sec: 295.185
INFO:tensorflow:loss = 5.111582, step = 2701 (0.339 sec)
INFO:tensorflow:global_step/sec: 289.957
INFO:tensorflow:loss = 12.047464, step = 2801 (0.345 sec)
INFO:tensorflow:global_step/sec: 283.826
INFO:tensorflow:loss = 7.286241, step = 2901 (0.352 sec)
INFO:tensorflow:global_step/sec: 281.05
INFO:tensorflow:loss = 6.200834, step = 3001 (0.356 sec)
INFO:tensorflow:global_step/sec: 284.173
INFO:tensorflow:loss = 8.660253, step = 3101 (0.352 sec)
INFO:tensorflow:global_step/sec: 288.516
INFO:tensorflow:loss = 7.3261228, step = 3201 (0.347 sec)
INFO:tensorflow:global_step/sec: 288.446
INFO:tensorflow:loss = 9.964396, step = 3301 (0.347 sec)
INFO:tensorflow:global_step/sec: 289.999
INFO:tensorflow:loss = 9.252493, step = 3401 (0.345 sec)
INFO:tensorflow:global_step/sec: 278.542
INFO:tensorflow:loss = 2.6778622, step = 3501 (0.359 sec)
INFO:tensorflow:Saving checkpoints for 3589 into output/model.ckpt.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-03-11-19:55:46
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from output/model.ckpt-3589
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2019-03-11-19:55:47
INFO:tensorflow:Saving dict for global step 3589: accuracy = 0.93525, accuracy_baseline = 0.84825003, auc = 0.9633356, auc_precision_recall = 0.8730494, average_loss = 0.1824272, global_step = 3589, label/mean = 0.15175, loss = 7.297088, precision = 0.8799127, prediction/mean = 0.12074697, recall = 0.66392094
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3589: output/model.ckpt-3589
INFO:tensorflow:Loss for final step: 0.59415644.
跟进2: 我想出了一种更好的方法来重现此问题。我使用的代码是此处的检查器示例:census_code。我使用的数据是这里的人口普查数据:census data
我有32,561个训练数据。由于batch_size设置为40,因此一个时代将有815个步骤。当我运行以下命令时:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
--job-dir $MODEL_DIR \
-- \
--train-files /Users/test/data/census/adult.data.csv \
--eval-files /Users/test/data/census/adult.test.csv \
--num-epochs 1 \
--train-steps 815 \
--eval-steps 100
有export
文件夹,我在日志中看到INFO:tensorflow:Performing the final export in the end of training.
。但是,如果我将train-steps
设置为大于815(我指示input_fn精确读取一个纪元,如num-epochs
中所指定),则不会有export
文件夹,也不会有{{1 }}。