我正在按照[deeplab教程] [1]对VOC数据集运行语义分段。这是我使用的命令行。
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=30000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=1 \
--dataset="pascal_voc_seg" \
--tf_initial_checkpoint="/data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt" \
--train_logdir="/data/DL-Phase3/carvana/train_on_train_set/train" \
--dataset_dir="/data/DL-Phase3/VOCdevkit/VOC2012/tfrecord"
列出了错误日志消息。在我看来,有两个主要的警告/错误
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp1_depthwise/depthwise_weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp3_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable image_pooling/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp0/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:From /data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-12 18:32:03.287833: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
INFO:tensorflow:Restoring parameters from /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /data/DL-Phase3/carvana/train_on_train_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
Caused by op 'CheckNumerics', defined at:
File "deeplab/train.py", line 392, in <module>
tf.app.run()
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 335, in main
total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
Traceback (most recent call last):
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
return fn(*args)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
status, run_metadata)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "deeplab/train.py", line 392, in <module>
tf.app.run()
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 385, in main
save_interval_secs=FLAGS.save_interval_secs)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train
sess, train_op, global_step, train_step_kwargs)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]
Caused by op 'CheckNumerics', defined at:
File "deeplab/train.py", line 392, in <module>
tf.app.run()
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "deeplab/train.py", line 335, in main
total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
op_def=op_def)
File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
[[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]