Keras中pong-v0的策略梯度代码的问题

时间:2019-05-24 09:40:45

标签: python keras reinforcement-learning pong policy-gradient-descent

我是机器学习的新手,正在尝试用pong-v0编写的一个代码。我正在使用策略梯度方法,并通过使用折现奖励减去价值估算器(基准)来计算优势函数。然后将优势函数与对数操作概率相乘。花费大量时间更新模型而不收敛。

基本上我正在尝试

$json = json_encode($resultArr);

我已经创建了一个用于评估和更新模型的Policyestimator模型

up_probability = model.predict(state_delta)
new_state, reward, done, _ = env.step(up_probability)
action_prob = 1- up_probability
reward = discounted_mean_reward(reward)
gradient = action_prob * reward

model.fit(state_delta, gradient)

在这里,我正在导入体育馆环境并基于advantge调制策略梯度更新来更新模型
大部分代码都来自此处(http://karpathy.github.io/2016/05/31/rl/

class PolicyEstimator():
    """Policy Function approximator"""

    def __init__(self, observation_space, hidden_layer_size, action_space):
        self.observation_space = observation_space
        self.hidden_layer_size = hidden_layer_size
        self.action_space = action_space
        self.model = self.build_model()

    def build_model(self):
        model = Sequential()
        model.add(Dense(units=self.hidden_layer_size, input_dim= self.observation_space, activation='relu', kernel_initializer='RandomNormal'))
        # output layer
        model.add(Dense(units=self.action_space, activation='sigmoid', kernel_initializer='RandomNormal'))
        # compile the model using traditional Machine learning losses and optimizers
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        return model

    def predict(self, state):
        state = np.expand_dims(state, axis=1).T
        return self.model.predict(state)

    def update(self, state, logpg):
        self.model.fit(x=state,y=logpg, epochs=1, verbose=1)

即使连续50集,奖励为-21

0 个答案:

没有答案