我正尝试每5000步验证一次模型。验证基于另一个预训练模型来验证训练模型。前5000个步骤正常,没有错误。验证进行得很好。但是,在转发5001步骤时。出现错误:ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[22,32,256,256]
。张量OOM位于训练模型图中的某个地方,而不是验证的预训练模型。我想这是由于用于验证的预训练模型未发布?
我想知道在培训期间验证培训模型的正确方法。难道我做错了什么?有人告诉我在培训期间进行验证的正确方法吗?非常感谢!
这是简化的代码:
with tf.Session(config=config) as sess:
sess.run(init)
try:
for step in range(STEP_START, TRAINING_STEPS):
def should(freq):
return freq > 0 and ((step + 1) % freq == 0 or step == TRAINING_STEPS - 1)
fetches = {
'train': model.train,
'global_step': model.incr_global_step,
'update_ops': model.update_ops,
}
results = sess.run(fetches)
if should(args.validation_freq):
psnr, ssim = validation_psnr_ssim.evaluation(isTest=0, CHECKPOINT_DIR=MODEL_SAVE_PATH,num=args.validation_num)
acd = validation_acd.evaluation()
mean_wer, mean_wer_norm = validation_wer.evaluation()