我遵循了这个tutorial并尝试对其进行了一些修改,以查看我是否正确理解了内容。但是,当我尝试使用torch.opim.SGD
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0")
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.nn.Parameter(torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True))
w2 = torch.nn.Parameter(torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True))
lr = 1e-6
optimizer=torch.optim.SGD([w1,w2],lr=lr)
for t in range(500):
layer_1 = x.matmul(w1)
layer_1 = F.relu(layer_1)
y_pred = layer_1.matmul(w2)
loss = (y_pred - y).pow(2).sum()
print(t,loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
,我的损失在第三次迭代中达到Inf,然后到nan,与手动更新相比完全不同。手动更新的代码在下面(也在教程链接中)。
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
loss.backward()
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_()
w2.grad.zero_()
我想知道修改后的版本(第一个片段)有什么问题。当我用Adam替换SGD时,结果非常好(每次迭代后递减,没有Inf或nan)。