我已经用Pytorch建立了强化学习模型。我使用了Q学习原则。
该模型在200集之前表现良好。然后奖励开始下降。在此之后开始在负面阶段跳来跳去。 您可以在这里看到它:reward_2000_epoch
我已经用ε,延迟因子,学习率,伽马做了几个超参数测试。同样在我测试的模型中,是否具有nn.Dropout,也具有nn.Relu。 但是如上图所示,没有什么能带来更好的结果。有时结果会更糟。 模型代码为:
class Policy(nn.Module):
def __init__(self):
super(Policy, self).__init__()
self.state_space = env.observation_space.shape[0] #2
self.action_space = env.action_space.n #3
self.hidden = 50
self.l1 = nn.Linear(self.state_space, self.hidden, bias = False)
self.l2 = nn.Linear(self.hidden, self.action_space, bias = False)
def forward(self, x):
model = torch.nn.Sequential(
self.l1,
nn.Dropout(p=0.6),
nn.ReLU(),
self.l2,
)
return model(x)
功能:
loss_fn = nn.MSELoss()
optimizer = optim.SGD(policy.parameters(), lr = learning_rate)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = 1, gamma=0.99)
while not done:
'chose action'
#Step forward and recive next state and reward
Q1 = policy(Variable(torch.from_numpy(state).type(torch.FloatTensor)))
maxQ1, _ = torch.max(Q1, -1)
#Create target Q value for training the policy
Q_target = Q.clone()
Q_target = Variable(Q_target.data)
Q_target[action] = reward + torch.mul(maxQ1.detach(), gamma)
#Calculate loss
loss = loss_fn(Q, Q_target)
#Update policy
policy.zero_grad()
loss.backward()
optimizer.step()
'find out if done'
if done:
scheduler.step()
我能做的是,该模型甚至在200集后仍会学习或什至持有奖励吗?