Question

我正在按照[deeplab教程] [1]对VOC数据集运行语义分段。这是我使用的命令行。

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=30000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size=513 \
    --train_crop_size=513 \
    --train_batch_size=1 \
    --dataset="pascal_voc_seg" \
    --tf_initial_checkpoint="/data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt" \
    --train_logdir="/data/DL-Phase3/carvana/train_on_train_set/train" \
    --dataset_dir="/data/DL-Phase3/VOCdevkit/VOC2012/tfrecord"

列出了错误日志消息。在我看来，有两个主要的警告/错误

WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt


INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values                                                                                                          
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]    




WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                              
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                           
WARNING:tensorflow:Variable aspp2_pointwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                              
WARNING:tensorflow:Variable aspp1_depthwise/depthwise_weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                         
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                          
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable aspp3_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                            
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable aspp0/BatchNorm/moving_variance missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv1_depthwise/BatchNorm/beta/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                            
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                           
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                            
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                     
WARNING:tensorflow:Variable image_pooling/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                
WARNING:tensorflow:Variable aspp3_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable image_pooling/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                     
WARNING:tensorflow:Variable aspp0/weights missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt
WARNING:tensorflow:Variable concat_projection/BatchNorm/beta missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                   
WARNING:tensorflow:Variable aspp1_depthwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                    
WARNING:tensorflow:Variable decoder/decoder_conv1_pointwise/weights/Momentum missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                   
WARNING:tensorflow:Variable decoder/decoder_conv0_depthwise/BatchNorm/moving_mean missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                              
WARNING:tensorflow:Variable decoder/decoder_conv0_pointwise/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                    
WARNING:tensorflow:Variable image_pooling/BatchNorm/gamma missing in checkpoint /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                                                                                                                      
WARNING:tensorflow:From /data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.                                                                                                                                    
Instructions for updating:                                                                                                            
Please switch to tf.train.MonitoredTrainingSession                                                                                    
2018-06-12 18:32:03.287833: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX                                                                                      
INFO:tensorflow:Restoring parameters from /data/DL-Phase3/deeplab-modelzoo/xception/model.ckpt                      
INFO:tensorflow:Starting Session.                                                                                                     
INFO:tensorflow:Saving checkpoint to path /data/DL-Phase3/carvana/train_on_train_set/train/model.ckpt               
INFO:tensorflow:Starting Queues.                                                                                                      
INFO:tensorflow:global_step/sec: 0                                                                                                    
INFO:tensorflow:Recording summary at step 0.                                                                                          
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Loss is inf or nan. : Tensor had NaN values                                                                                                          
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

Caused by op 'CheckNumerics', defined at:
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()                                
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))                                                                                                             
  File "deeplab/train.py", line 335, in main                                                                                          
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')                                                                 
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics                                                                                                                      
    "CheckNumerics", tensor=tensor, message=message, name=name)                                                                       
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper                                                                                                             
    op_def=op_def)                                                                                                                    
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op                                                                                                                              
    op_def=op_def)                                                                                                                    
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__                                                                                                                               
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access                                                

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

Traceback (most recent call last):
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call                                                                                                                              
    return fn(*args)                                                                                                                  
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn                                                                                                                               
    status, run_metadata)                                                                                                             
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__                                                                                                                        
    c_api.TF_GetCode(self.status.status))                                                                                             
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values                             
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]                                                                                                                   

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()                                
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))                                                                                                             
  File "deeplab/train.py", line 385, in main                                                                                          
    save_interval_secs=FLAGS.save_interval_secs)                                                                                      
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 767, in train                                                                                                                      
    sess, train_op, global_step, train_step_kwargs)                                                                                   
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step                                                                                                                 
    run_metadata=run_metadata)                                                                                                        
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 895, in run                                                                                                                                    
    run_metadata_ptr)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]

Caused by op 'CheckNumerics', defined at:
  File "deeplab/train.py", line 392, in <module>
    tf.app.run()
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "deeplab/train.py", line 335, in main
    total_loss = tf.check_numerics(total_loss, 'Loss is inf or nan.')
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 565, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/data/virtualE/tensorflow15/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Loss is inf or nan. : Tensor had NaN values
         [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="Loss is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN)]]

在voc数据集上运行deeplab v3的错误消息

0 个答案: