我是机器学习的新手,正在尝试用pong-v0编写的一个代码。我正在使用策略梯度方法,并通过使用折现奖励减去价值估算器(基准)来计算优势函数。然后将优势函数与对数操作概率相乘。花费大量时间更新模型而不收敛。
基本上我正在尝试
$json = json_encode($resultArr);
我已经创建了一个用于评估和更新模型的Policyestimator模型
up_probability = model.predict(state_delta)
new_state, reward, done, _ = env.step(up_probability)
action_prob = 1- up_probability
reward = discounted_mean_reward(reward)
gradient = action_prob * reward
model.fit(state_delta, gradient)
在这里,我正在导入体育馆环境并基于advantge调制策略梯度更新来更新模型
大部分代码都来自此处(http://karpathy.github.io/2016/05/31/rl/)
class PolicyEstimator():
"""Policy Function approximator"""
def __init__(self, observation_space, hidden_layer_size, action_space):
self.observation_space = observation_space
self.hidden_layer_size = hidden_layer_size
self.action_space = action_space
self.model = self.build_model()
def build_model(self):
model = Sequential()
model.add(Dense(units=self.hidden_layer_size, input_dim= self.observation_space, activation='relu', kernel_initializer='RandomNormal'))
# output layer
model.add(Dense(units=self.action_space, activation='sigmoid', kernel_initializer='RandomNormal'))
# compile the model using traditional Machine learning losses and optimizers
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
def predict(self, state):
state = np.expand_dims(state, axis=1).T
return self.model.predict(state)
def update(self, state, logpg):
self.model.fit(x=state,y=logpg, epochs=1, verbose=1)
即使连续50集,奖励为-21