我正在尝试在PyTorch中实现一个称为SMAPE的特定损失函数(通常在时间序列预测中使用)。我有两个变量model_outputs
和target_outputs
,计算元素级SMAPE的公式很简单:
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss) # loss = 0.023207199
loss.backward()
但是当我运行代码时,收到以下错误:RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
我使用autograd.detect_anomaly()
来跟踪源自elementwise_smape = torch.div(numerator, denominator)
的错误。我尝试了可以在网上找到的建议解决方案(检查NaN值,添加钩子,打印渐变),但没有任何帮助。因此,我决定将张量写入磁盘并编写一个独立的脚本:
import torch
model_outputs = torch.load('model_outputs.pt')
target_outputs = torch.load('target_outputs.pt')
print(model_outputs)
print(target_outputs)
model_outputs.requires_grad = True
target_outputs.requires_grad = False
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)
print(loss.detach().numpy())
loss.backward()
print('Success!')
这完全正常!为了测试问题不是由elementwise_smape
所依赖的任何变量引起的(model_outputs
,numerator
,denominator
),我尝试用每个变量替换损失函数因变量,但所有方法都起作用!
nan_mask = torch.isnan(elementwise_smape)
loss = model_outputs[~nan_mask].mean()
loss.backward() # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = numerator[~nan_mask].mean()
loss.backward() # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = denominator[~nan_mask].mean()
loss.backward() # succeeds
当模型生成model_outputs
时而不是从磁盘加载model_outputs
时,导致问题的按元素划分到底会发生什么呢?
更新#1:添加完整的错误:
Traceback of forward call that caused the error:
File "/home/rylan/tsml/main.py", line 116, in <module>
main(experiment_number)
File "/home/rylan/tsml/main.py", line 23, in main
loss_per_model_per_step=loss_per_model_per_step)
File "/home/rylan/tsml/main.py", line 47, in run
sequence_lengths)
File "/home/rylan/tsml/main.py", line 80, in step
sequence_lengths=sequence_lengths)
File "/home/rylan/tsml/models.py", line 153, in loss
model_output_log_probs=None)
File "/home/rylan/tsml/losses.py", line 96, in calculate_mean_smape
target_outputs=target_outputs)
File "/home/rylan/tsml/losses.py", line 117, in calculate_elementwise_smape
elementwise_smape = torch.div(numerator, denominator)
Traceback (most recent call last):
File "/home/rylan/tsml/main.py", line 116, in <module>
main(experiment_number)
File "/home/rylan/tsml/main.py", line 23, in main
loss_per_model_per_step=loss_per_model_per_step)
File "/home/rylan/tsml/main.py", line 47, in run
sequence_lengths)
File "/home/rylan/tsml/main.py", line 89, in step
total_loss.backward()
File "/home/rylan/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/rylan/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
Process finished with exit code 1
更新2:Finomnis建议将张量写到磁盘上,读回它们并将它们与原始张量进行比较:
torch.save(model_outputs, 'model_outputs.pt')
torch.save(target_outputs, 'target_outputs.pt')
model_outputs_2 = torch.load('model_outputs.pt')
target_outputs_2 = torch.load('target_outputs.pt')
(model_outputs == model_outputs_2).all() # tensor(1, dtype=torch.uint8)
(target_outputs == target_outputs_2).all() # tensor(0, dtype=torch.uint8)
我怀疑由于(target_outputs == target_outputs_2).all()
,False
的评估结果为NaNs
。这是正确的:
(torch.isnan(target_outputs) == torch.isnan(target_outputs_2)).all() # tensor(1, dtype=torch.uint8)
(target_outputs[~torch.isnan(target_outputs)] == target_outputs_2[~torch.isnan(target_outputs_2)]).all() # tensor(1, dtype=torch.uint8)
更新#3:我尝试了下一个分离model_outputs
的建议,以确保问题不是来自其他地方:
def calculate_elementwise_smape(model_outputs,
target_outputs):
model_outputs = model_outputs.detach()
model_outputs.requires_grad = True
target_outputs = target_outputs.detach()
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
return elementwise_smape
但是这在相同的位置RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
上产生了相同的错误elementwise_smape = torch.div(numerator, denominator)
重申我之前所说的,如果我从磁盘加载张量并执行完全相同的操作顺序,则不会发生此问题。