Question

我已经用Pytorch建立了强化学习模型。我使用了Q学习原则。

该模型在200集之前表现良好。然后奖励开始下降。在此之后开始在负面阶段跳来跳去。您可以在这里看到它：reward_2000_epoch

我已经用ε，延迟因子，学习率，伽马做了几个超参数测试。同样在我测试的模型中，是否具有nn.Dropout，也具有nn.Relu。但是如上图所示，没有什么能带来更好的结果。有时结果会更糟。模型代码为：

class Policy(nn.Module):

    def __init__(self):
        super(Policy, self).__init__()
        self.state_space = env.observation_space.shape[0] #2
        self.action_space = env.action_space.n #3
        self.hidden = 50 
        self.l1 = nn.Linear(self.state_space, self.hidden, bias = False)
        self.l2 = nn.Linear(self.hidden, self.action_space, bias = False)

    def forward(self, x):
        model = torch.nn.Sequential(
            self.l1,
            nn.Dropout(p=0.6),
            nn.ReLU(),
            self.l2,
        )
        return model(x)

功能：

loss_fn = nn.MSELoss()
optimizer = optim.SGD(policy.parameters(), lr = learning_rate)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = 1, gamma=0.99)

while not done:
    'chose action'

    #Step forward and recive next state and reward
    Q1 = policy(Variable(torch.from_numpy(state).type(torch.FloatTensor)))
    maxQ1, _ = torch.max(Q1, -1)

    #Create target Q value for training the policy
    Q_target = Q.clone()
    Q_target = Variable(Q_target.data)
    Q_target[action] = reward + torch.mul(maxQ1.detach(), gamma)

    #Calculate loss 
    loss = loss_fn(Q, Q_target)

    #Update policy
    policy.zero_grad()
    loss.backward()
    optimizer.step()

    'find out if done'

    if done:
        scheduler.step()

我能做的是，该模型甚至在200集后仍会学习或什至持有奖励吗？

RL-〜200集后奖励掉落

0 个答案: