我有20份environment,当达到目标时会得到0.1的奖励,否则得到0。
我想做的是弄清如何规范奖励。假设我在环境中运行了300个时间步。因此,奖励矩阵的大小为300x20。
我通常会这样做:
discounted_rewards = torch.zeros_like(rewards, dtype=torch.float32, device=device)
for t in reversed(range(len(rewards))):
running_add = rewards[t] + discount * running_add
discounted_rewards[t] = running_add
mean = discounted_rewards.mean(0, keepdim=True)
std = discounted_rewards.std(0, keepdim=True) + 1e-10
discounted_rewards = (discounted_rewards - mean) / std
但是,当我做作业时,我将整个环境的均值归一化,而不是时间步长。即
mean = discounted_rewards.mean(1, keepdim=True)
std = discounted_rewards.std(1, keepdim=True) + 1e-10
discounted_rewards = (discounted_rewards - mean) / std
这似乎训练得更快。如果有帮助,请使用PPO。
所以我的问题是: