我正在尝试在喀拉拉邦建立一个以状态为输入并输出采取可能行动的可能性的政策网络。
我的模型定义如下,并使用自定义损失函数:
inp = Input(shape=[8], name="input_x")
discounted_rewards = Input(shape=[1], name="discounted_rewards")
model = Dense(10,
kernel_initializer=initializers.glorot_normal(seed=1),
bias_initializer=initializers.glorot_normal(seed=1),
activation="relu",
name="dense_1")(inp)
model = Dense(10,
kernel_initializer=initializers.glorot_normal(seed=1),
bias_initializer=initializers.glorot_normal(seed=1),
activation="relu",
name="dense_2")(model)
unscaled = Dense(4,
kernel_initializer=initializers.glorot_normal(seed=1),
bias_initializer=initializers.glorot_normal(seed=1),
name="unscaled")(model)
out = Activation(activations.softmax)(unscaled)
def custom_loss(y_true=None, y_pred=None):
neg_log_prob = losses.categorical_crossentropy(y_true=y_true, y_pred=y_pred)
return K.mean(neg_log_prob * discounted_rewards)
losses.custom_loss = custom_loss
model_train = Model(inputs=[inp, discounted_rewards], outputs=out)
model_train.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam(0.02))
model_predict = Model(inputs=[inp], outputs=out)
model_predict.compile(loss=custom_loss, optimizer=tf.keras.optimizers.Adam(0.02))
model_train.summary()
我的折扣奖励可以是正面的或负面的,而我希望模型更喜欢可以带来正面奖励的行动。据我所知,categorical_crossentropy导致对输入状态采取动作的对数概率为负。因此,当我将它们与正奖励相乘时,将导致负值;而当存在负奖励时,将导致正值。
但是,当我训练模型时,即使我希望损失具有越来越大的负值,损失也会从负值收敛到零,因为那将意味着报酬更高。
为什么它朝着零优化而不是朝负值最小化?