Question

我正在根据this教程实施用于解决Pong游戏的Policy Network RL方法。但是，我在理解custom_loss()函数时遇到了麻烦。

custom_loss()函数接受两个输入y_true和y_pred。但是，在RL设置中，没有y_true（即标签），因为您只是在环境中行动并获得奖励。

那么在此示例中，损失函数如何计算？我唯一的假设是y_true是分布中采取的动作：

（{action = np.random.choice(range(num_actions),p=predict)），

和y_pred是概率分布（即策略网络的输出）：

predict = model_predict.predict([state])[0]

如果是这种情况，则在训练调用loss = model_train.train_on_batch([states, discounted_rewards], actions_train)中不会传递网络输出的概率分布，只有选定的，一键编码的动作会通过。

以下是相关代码的片段，但是遵循上面的链接可能更容易。

...

def custom_loss(y_true, y_pred):
    # actual: 0 predict: 0 -> log(0 * (0 - 0) + (1 - 0) * (0 + 0)) = -inf
    # actual: 1 predict: 1 -> log(1 * (1 - 1) + (1 - 1) * (1 + 1)) = -inf
    # actual: 1 predict: 0 -> log(1 * (1 - 0) + (1 - 1) * (1 + 0)) = 0
    # actual: 0 predict: 1 -> log(0 * (0 - 1) + (1 - 0) * (0 + 1)) = 0
    log_lik = K.log(y_true * (y_true - y_pred) + (1 - y_true) * (y_true + y_pred))
    return K.mean(log_lik * adv, keepdims=True)

model_train = Model(inputs=[inp, adv], outputs=out)
model_train.compile(loss=custom_loss, optimizer=Adam(lr))
model_predict = Model(inputs=[inp], outputs=out)  

...

loss = model_train.train_on_batch([states, discounted_rewards], actions_train)

我希望损失函数考虑到网络输出的概率分布，选择的操作和优势（折现的奖励），但是对于实现的自定义损失函数感到困惑。

了解强化学习中的Keras损失函数输入

0 个答案: