我目前正在PyTorch中实现策略梯度。由于与问题无关的某些原因,我无法按如下方式使用反向()直接计算梯度(此代码工作得很好):
n_episodes = len(states)
states = torch.tensor(np.array([state for episode in states for state in episode[:-1]])).float()
actions = torch.tensor(np.array([action for episode in actions for action in episode])).float()
advantages = torch.tensor(self.compute_advantages(rewards, normalize=True)).float()
std = torch.exp(self.log_std)
log_probs = torch.distributions.normal.Normal(self.forward(states), std).log_prob(actions).flatten()
loss = - torch.dot(log_probs, advantages)
loss.backward()
self.optimizer.step()
我想逐个状态手动计算梯度。我知道它的计算效率要低得多,但这不是重点。以我的理解,以下代码应该有效:
for i in range(len(actions)):
state = states[i]
action = actions[i]
advantage = advantages[i]
for name, param in self.named_parameters():
std = torch.exp(self.log_std)
dist = torch.distributions.normal.Normal(self.forward(torch.from_numpy(state).float()), std)
param.grad -= grad(dist.log_prob(torch.from_numpy(action).float()), param)[0] * advantages[i]
self.optimizer.step()
但是,计算出的梯度与使用.backward()获得的梯度完全不同。我有什么不对吗?