我正在尝试根据torchvision示例为自定义数据集运行更快的r-cnn模型。
但是,我注意到在训练时,如果xmax小于xmin,则rpn_box_reg的损失将变为nan。 xmax和ymax代表左上角,xmin和ymin代表右下角。这是我在打印边界框时遇到的错误的摘要:
tensor([[ 44., 108., 49., 224.],
[ 29., 73., 210., 230.],
[ 31., 58., 139., 228.],
[ 22., 43., 339., 222.]], device='cuda:0')
Epoch: [0] [ 0/1173] eta: 0:09:46 lr: 0.000000 loss: 9.3683 (9.3683) loss_classifier: 1.7522 (1.7522) loss_box_reg: 0.0755 (0.0755) loss_objectness: 6.1522 (6.1522) loss_rpn_box_reg: 1.3884 (1.3884) time: 0.4997 data: 0.1162 max mem: 5696
tensor([[ 0., 0., 640., 512.]], device='cuda:0')
tensor([[ 28., 57., 197., 220.]], device='cuda:0')
tensor([[ 23., 46., 281., 222.]], device='cuda:0')
tensor([[ 20., 28., 328., 210.]], device='cuda:0')
tensor([[ 37., 45., 47., 161.],
[ 31., 39., 111., 154.]], device='cuda:0')
tensor([[ 0., 0., 640., 512.]], device='cuda:0')
tensor([[ 33., 85., 546., 222.],
[ 31., 85., 527., 213.]], device='cuda:0')
tensor([[ 40., 76., 29., 211.],
[ 64., 51., 26., 206.],
[ 40., 77., 1., 221.]], device='cuda:0')
Loss is nan, stopping training
{'loss_classifier': tensor(1.78, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(16.28, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)}
An exception has occurred, use %tb to see the full traceback
As you can see, for each box is set as [xmin, ymin, xmax, ymax].
我曾尝试调整学习率,但仍然遇到相同的错误:
optimizer = torch.optim.SGD(params, lr=0.00001,
momentum=0.9, weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=1,
gamma=0.1)
问题似乎是当x1(xmax)值小于x2(xmin)时,损失rpn_box_reg损失变为NaN。例如,对于下面的图像,绑定框为张量([[53.,89.,7.,226.]]),即[x2,y2,x1,y1]。当x1值小于x2时,损耗变为零,但是,当x1> x2时,损耗会很好。实际上,它训练得很好。如您所见,这些值是正确的,因为骑自行车的人基于上述值具有正确的边界框。我希望这使我所面临的问题更加清楚。