RuntimeError'DivBackward0'nan值在其第0个输出中,但是当从磁盘加载张量时可以工作吗?

时间:2019-07-12 20:32:20

标签: python python-3.6 pytorch

我正在尝试在PyTorch中实现一个称为SMAPE的特定损失函数(通常在时间序列预测中使用)。我有两个变量model_outputstarget_outputs,计算元素级SMAPE的公式很简单:

numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)  # loss = 0.023207199
loss.backward()

但是当我运行代码时,收到以下错误:RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

我使用autograd.detect_anomaly()来跟踪源自elementwise_smape = torch.div(numerator, denominator)的错误。我尝试了可以​​在网上找到的建议解决方案(检查NaN值,添加钩子,打印渐变),但没有任何帮助。因此,我决定将张量写入磁盘并编写一个独立的脚本:

import torch


model_outputs = torch.load('model_outputs.pt')
target_outputs = torch.load('target_outputs.pt')
print(model_outputs)
print(target_outputs)

model_outputs.requires_grad = True
target_outputs.requires_grad = False

numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)
print(loss.detach().numpy())
loss.backward()
print('Success!')

这完全正常!为了测试问题不是由elementwise_smape所依赖的任何变量引起的(model_outputsnumeratordenominator),我尝试用每个变量替换损失函数因变量,但所有方法都起作用!

nan_mask = torch.isnan(elementwise_smape)
loss = model_outputs[~nan_mask].mean()
loss.backward()  # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = numerator[~nan_mask].mean()
loss.backward()  # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = denominator[~nan_mask].mean()
loss.backward()  # succeeds

当模型生成model_outputs时而不是从磁盘加载model_outputs时,导致问题的按元素划分到底会发生什么呢?

更新#1:添加完整的错误:

Traceback of forward call that caused the error:
  File "/home/rylan/tsml/main.py", line 116, in <module>
    main(experiment_number)
  File "/home/rylan/tsml/main.py", line 23, in main
    loss_per_model_per_step=loss_per_model_per_step)
  File "/home/rylan/tsml/main.py", line 47, in run
    sequence_lengths)
  File "/home/rylan/tsml/main.py", line 80, in step
    sequence_lengths=sequence_lengths)
  File "/home/rylan/tsml/models.py", line 153, in loss
    model_output_log_probs=None)
  File "/home/rylan/tsml/losses.py", line 96, in calculate_mean_smape
    target_outputs=target_outputs)
  File "/home/rylan/tsml/losses.py", line 117, in calculate_elementwise_smape
    elementwise_smape = torch.div(numerator, denominator)


Traceback (most recent call last):
  File "/home/rylan/tsml/main.py", line 116, in <module>
    main(experiment_number)
  File "/home/rylan/tsml/main.py", line 23, in main
    loss_per_model_per_step=loss_per_model_per_step)
  File "/home/rylan/tsml/main.py", line 47, in run
    sequence_lengths)
  File "/home/rylan/tsml/main.py", line 89, in step
    total_loss.backward()
  File "/home/rylan/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/rylan/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Process finished with exit code 1

更新2:Finomnis建议将张量写到磁盘上,读回它们并将它们与原始张量进行比较:

torch.save(model_outputs, 'model_outputs.pt')
torch.save(target_outputs, 'target_outputs.pt')
model_outputs_2 = torch.load('model_outputs.pt')
target_outputs_2 = torch.load('target_outputs.pt')
(model_outputs == model_outputs_2).all()  # tensor(1, dtype=torch.uint8)
(target_outputs == target_outputs_2).all()  # tensor(0, dtype=torch.uint8)

我怀疑由于(target_outputs == target_outputs_2).all()False的评估结果为NaNs。这是正确的:

(torch.isnan(target_outputs) == torch.isnan(target_outputs_2)).all()  # tensor(1, dtype=torch.uint8)
(target_outputs[~torch.isnan(target_outputs)] == target_outputs_2[~torch.isnan(target_outputs_2)]).all()  # tensor(1, dtype=torch.uint8)

更新#3:我尝试了下一个分离model_outputs的建议,以确保问题不是来自其他地方:

def calculate_elementwise_smape(model_outputs,
                                target_outputs):
    model_outputs = model_outputs.detach()
    model_outputs.requires_grad = True

    target_outputs = target_outputs.detach()

    numerator = torch.abs(model_outputs - target_outputs)
    denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
    elementwise_smape = torch.div(numerator, denominator)
    return elementwise_smape

但是这在相同的位置RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.上产生了相同的错误elementwise_smape = torch.div(numerator, denominator)

重申我之前所说的,如果我从磁盘加载张量并执行完全相同的操作顺序,则不会发生此问题。

0 个答案:

没有答案