(回复bug:https://github.com/zihualiu/pytorch_linear_bug)
我最近在Pytorch遇到了一个奇怪的错误,我希望你能帮助我。在我的一个网络中,我有一个完全连接的层,表示为net.fc_h1。然而,在训练期间,我意识到该层在激活之前输出NaN。所以我把它放在pdb中,希望它会让我产生一些东西。以下是日志:
# in network declaration:
def forward(self, obs):
z1 = self.fc_h1(obs)
if np.isnan(np.sum(z1.data.numpy())):
pdb.set_trace()
h1 = F.tanh(z1)
...
确实捕获了NaN,但我在pdb中意识到如果再次运行该操作,结果将是显着的:
(Pdb) z1.sum()
Variable containing:
nan
[torch.FloatTensor of size 1]
(Pdb) self.fc_h1(obs).sum()
Variable containing:
771.5120
[torch.FloatTensor of size 1]
当我检查我的输入或重量是否包含NaN时,我得到以下内容: (Pdb)self.fc_h1.weight.max() 变量包含: 0.2482 [大小为1的torch.FloatTensor]
(Pdb) self.fc_h1.weight.mean()
Variable containing:
1.00000e-03 *
1.7761
[torch.FloatTensor of size 1]
(Pdb) self.fc_h1.weight.min()
Variable containing:
-0.2504
[torch.FloatTensor of size 1]
(Pdb) obs.max()
Variable containing:
6.9884
[torch.FloatTensor of size 1]
(Pdb) obs.min()
Variable containing:
-6.7855
[torch.FloatTensor of size 1]
(Pdb) obs.mean()
Variable containing:
1.00000e-02 *
-1.5033
[torch.FloatTensor of size 1]
(Pdb) self.fc_h1.bias.max()
Variable containing:
0.2482
[torch.FloatTensor of size 1]
(Pdb) self.fc_h1.bias.mean()
Variable containing:
1.00000e-03 *
3.9104
[torch.FloatTensor of size 1]
(Pdb) self.fc_h1.bias.min()
Variable containing:
-0.2466
[torch.FloatTensor of size 1]
似乎输入,重量和偏见都很好。如果一切都形成良好,线性层如何产生NaN的任何见解?
编辑:更奇怪 所以我试图再次进行前锋传球,有趣的是,多次前锋传球给了我不同的结果:(Pdb) self.fc_h1(obs)
Variable containing:
2.2321e-01 -6.2586e-01 -1.9004e-01 ... -4.2521e-01 8.6175e-01 8.6866e-01
-7.2699e-02 7.8234e-01 -5.8862e-01 ... 2.4041e-01 -1.7577e-01 6.9928e-01
-7.2699e-02 7.8234e-01 -5.8862e-01 ... 2.4041e-01 -1.7577e-01 6.9928e-01
... ⋱ ...
-6.4686e-02 -1.5819e+00 5.7410e-01 ... -6.4127e-01 5.2837e-01 -1.3166e+00
3.9214e-01 2.8727e-01 -5.5699e-01 ... -8.3164e-01 -5.1795e-01 -3.7637e-01
-9.6061e-01 1.4780e-01 5.3614e-02 ... -1.5042e+00 6.0759e-02 -3.6862e-01
[torch.FloatTensor of size 4096x170]
(Pdb) self.fc_h1(obs)
Variable containing:
2.2321e-01 -6.2586e-01 -1.9004e-01 ... -4.2521e-01 8.6175e-01 8.6866e-01
-7.2699e-02 7.8234e-01 -5.8862e-01 ... 2.4041e-01 -1.7577e-01 6.9928e-01
-7.2699e-02 7.8234e-01 -5.8862e-01 ... 2.4041e-01 -1.7577e-01 6.9928e-01
... ⋱ ...
nan nan nan ... nan 5.2837e-01 -1.3166e+00
nan nan nan ... nan -5.1795e-01 -3.7637e-01
nan nan nan ... nan 6.0759e-02 -3.6862e-01
[torch.FloatTensor of size 4096x170]
我也没有使用GPU,只是CPU。
答案 0 :(得分:0)
对我来说,我正在复制RNN名称分类示例中的代码。我添加了优化器和标准模式,而该示例是手动操作并手动更新权重。我无意中向优化器添加了动量值,这就是导致我遇到问题的原因。将动量设置为def remove(self):
'''-------------------------------------------------------
Removes and returns value from the queue.
Use: v = cq.remove()
-------------------------------------------------------
Returns:
value - the value at the front of the queue - the value is
removed from the queue (?)
-------------------------------------------------------'''
assert (self._count > 0), 'Cannot remove from an empty queue'
value = self._values[self._front]
self._front = (self._front + 1) % self._count
self._count += -1
return value
的默认值可修复该问题。