Question

我使用TPU使用自定义数据训练Deeplab，训练参数如下：

python /home/rkm/tpu/models/experimental/deeplab/main.py \
--tpu=rkm\
--mode='train' \
--num_shards=8 \
--alsologtostderr=true \
--model_dir=${MODEL_DIR} \
--dataset_dir=${DATASET_DIR} \
--init_checkpoint=${INIT_CHECKPOINT} \
--train_split="train" \
--eval_split="val" \
--train_steps=5000 \
--steps_per_eval=2 \
--train_batch_size=64 \
--eval_batch_size=8 \
--model_variant=xception_65\
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--image_pyramid=1. \
--aspp_with_separable_conv=false \
--multi_grid=1 \
--multi_grid=2 \
--multi_grid=4 \
--decoder_use_separable_conv=false \
--use_tpu=True

然后使用GPU将模型下载到我的计算机上并尝试进行评估，参数如下所示：

python "${WORK_DIR}"/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=513 \
--eval_crop_size=513 
--checkpoint_dir="${TRAIN_LOGDIR}" \
--eval_logdir="${EVAL_LOGDIR}" \
--dataset_dir="${DATASET}" \
--max_number_of_evaluations=1

我得到的错误是：

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
     [[node save/RestoreV2 (defined at models_new/research/deeplab/vis.py:263) ]]

基于this question，我在训练和评估中都设置了atrous_rates，但是它仍然会产生错误。

Tensorflow Deeplab：TPU训练的模型正在使用GPU评估期间生成NotFoundError

0 个答案: