我第一次玩pytorch,我注意到训练我的神经网络时,大约四分之一左右的损失会向无穷大左转,然后不久就变为nan。我还看到了有关命名的其他一些问题,但是那里的建议似乎本质上是要进行标准化的;但是我网络下面的第一层就是这样的规范化,我仍然看到这个问题!完整的网络为a bit convoluted,但是我已经进行了一些调试,以尝试生成一个很小的,易于理解的网络,但仍然显示相同的问题。
代码在下面;它由十六个输入(0-1)组成,这些输入经过批处理规范化,然后经过完全连接的层传递到单个输出。我希望它学习始终输出1的函数,所以我从1取平方误差作为损失。
import torch as t
import torch.nn as tn
import torch.optim as to
if __name__ == '__main__':
board = t.rand([1,1,1,16])
net = tn.Sequential \
( tn.BatchNorm2d(1)
, tn.Conv2d(1, 1, [1,16])
)
optimizer = to.SGD(net.parameters(), lr=0.1)
for i in range(10):
net.zero_grad()
nn_outputs = net.forward(board)
loss = t.sum((nn_outputs - 1)**2)
print(i, nn_outputs, loss)
loss.backward()
optimizer.step()
如果运行几次,最终将看到如下所示的运行:
0 tensor([[[[-0.7594]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(3.0953, grad_fn=<SumBackward0>)
1 tensor([[[[4.0954]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(9.5812, grad_fn=<SumBackward0>)
2 tensor([[[[5.5210]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(20.4391, grad_fn=<SumBackward0>)
3 tensor([[[[-3.4042]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(19.3966, grad_fn=<SumBackward0>)
4 tensor([[[[823.6523]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(676756.7500, grad_fn=<SumBackward0>)
5 tensor([[[[3.5471e+08]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(1.2582e+17, grad_fn=<SumBackward0>)
6 tensor([[[[2.8560e+25]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(inf, grad_fn=<SumBackward0>)
7 tensor([[[[inf]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(inf, grad_fn=<SumBackward0>)
8 tensor([[[[nan]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(nan, grad_fn=<SumBackward0>)
9 tensor([[[[nan]]]], grad_fn=<MkldnnConvolutionBackward>) tensor(nan, grad_fn=<SumBackward0>)
为什么我的损失会流向南,我该怎么办?
答案 0 :(得分:1)
欢迎使用pytorch!
这是我如何设置您的培训。请检查评论。
# how the comunity usually does the import:
import torch # some people do: import torch as th
import torch.nn as nn
import torch.optim as optim
if __name__ == '__main__':
# setting some parameters:
batch_size = 32
n_dims = 128
# select GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# initializing a simple neural net
net = nn.Sequential(nn.Linear(n_dims, n_dims // 2), # Batch norm is not usually used directly on the input
nn.BachNorm1d(n_dims // 2), # Batch norm is used before the activation function (it centers the input and helps make the dims of the previous layers independent of each other)
nn.ReLU(), # the most common activation function
nn.Linear(n_dims // 2, 1) # final layer)
net.to(device) # model is copied to the GPU if it is availalbe
optimizer = to.SGD(net.parameters(), lr=0.01) # it is better to start with a low lr and increase it at later experiments to avoid training divergence, the range [1.e-6, 5.e-2] is recommended.
for i in range(10):
# generating random data:
board = torch.rand([batch_size, n_dims])
# for sequences: [batch_size, channels, L]
# for image data: [batch_size, channels, W, H]
# for videos: [batch_size, chanels, L, W, H]
boad = board.to(device) # data is copied to the gpu if it is available
optimizer.zero_grad() # the convension the comunity uses, though the result is the same as net.zero_grad()
nn_outputs = net(board) # don't call net.forward(x), call net(x). Pytorch applies some hooks in the net.__call__(x) that are useful for backpropagation.
loss = ((nn_outputs - 1)**2).mean() # using .mean() makes your training less sensitive to the batch size.
print(i, nn_outputs, loss.item())
loss.backward()
optimizer.step()
关于批处理规范的一条评论。根据尺寸,它可以计算批次的平均值和标准偏差(请查看文档https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d):
x_normalized = (x.mean(dim=0) / (x.std(dim=0) + e-6)) * scale + shift
刻度和移位是可学习的参数。如果每批仅提供一个示例,则x.std(0) = 0
将使x_normalized
包含非常大的值。
答案 1 :(得分:0)
要使批次规范生效,您希望您的培训批次更大。
在2d或3d批处理规范的情况下,“有效批处理”还包括要素图的空间/时间维度。在您的示例中,您有batch_size=1
,并且只有16个样本在空间上排列。对于批处理规范而言,这可能太少而无法有效工作。
尝试增加“有效批量”:
board = t.rand([1, 1, 16, 16])
# or
board = t.rand([1, 16, 1, 16])
看看这如何影响您的训练。
顺便说一句,如果删除批处理规范,在玩具示例中会发生什么?