我是stackoverflow的新手,对不起,如果我做错了什么,或者我没有正确解释自己。我正在尝试做一个基本上简化PacMan框架的神经网络。网络使用策略梯度自定义损失来学习使用奖励。我想我了解自定义损失的工作原理,当您的报酬高或损失低时,它会继续这样做,否则它将采取其他措施。 这是我的神经网络的课程:
pkill -f 'sleep 60'
我有3个输入,last_state是游戏的2帧,灰度且形状为(256,256,1):
类似的东西,但是是灰度的。
last_action是一个数组,例如:
[0,1,0,0]
它将转换为[上,下,右,左]。
这是网络或随机策略选择的最后一个动作。
最后,last_reward是奖励函数给定的last_action的last_state所给出的奖励。
我只有一个输出,一个动作,可以随机选择或使用神经网络来选择。我出于探索目的有一项随机政策。
现在,我认为神经网络部分已经阐明,让我们继续其余的代码。这是游戏运行的我的特工类:
import numpy as np
import tensorflow as tf
from keras import optimizers
from keras.models import Model
from keras.layers import Input, Conv2D, Flatten, Dense
class PolicyGradientCNN:
def __init__(self, state_space, action_space, epsilon, epsilon_min, epsilon_decay):
self.action_space = action_space
self.state_space = state_space
self.epsilon = epsilon
self.epsilon_min = epsilon_min
self.epsilon_decay = epsilon_decay
self.buildModel()
adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, decay=0.0)
self.model.compile(loss = None, optimizer=adam, metrics=['mae'])
self.model.summary()
def buildModel(self):
# We have 3 inputs, a state, an action, and a reward of that action in that state
last_state = Input(shape=self.state_space, name='input')
last_action = Input(shape=(self.action_space,), name='last_action')
last_reward = Input(shape=(1,), name='reward')
# How we are using an image as an input we need convolutions
f = Conv2D(32, 8, strides=(4, 4), activation = 'relu', input_shape=self.state_space, kernel_initializer='glorot_uniform')(last_state)
f = Conv2D(64, 4, strides=(2, 2), activation = 'relu', input_shape=self.state_space, kernel_initializer='glorot_uniform')(f)
f = Conv2D(64, 3, strides=(1, 1), activation = 'relu', input_shape=self.state_space, kernel_initializer='glorot_uniform')(f)
f = Flatten()(f)
f = Dense(1024, activation = 'relu', kernel_initializer='glorot_uniform')(f)
f = Dense(512, activation = 'relu', kernel_initializer='glorot_uniform')(f)
# We predict an action as an output with the size of the action_space
action_pred = Dense(self.action_space, activation = 'softmax', kernel_initializer='glorot_uniform')(f)
self.model = Model(inputs=[last_state, last_action, last_reward], outputs = [action_pred])
self.model.add_loss(self.customLoss(action_pred, last_action, last_reward))
# This loss function is a policy gradient loss function
def customLoss(self, action_pred, last_action, last_reward):
neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits = action_pred, labels = last_action)
loss = tf.reduce_mean(neg_log_prob * last_reward)
return loss
# To choose an action we need to have some exploration, we make this posible by an epsilon
def chooseAction(self, state):
self.epsilon *= self.epsilon_decay
self.epsilon = max(self.epsilon_min, self.epsilon)
print("Epsilon")
print(self.epsilon)
r = np.random.random()
if r > self.epsilon:
print(" ********************* CHOOSING A PREDICTED ACTION **********************")
actions = np.ones((2, self.action_space))
rewards = np.ones((2, 1))
pred = self.model.predict([state, actions, rewards])
action = pred
else:
print("******* CHOOSING A RANDOM ACTION *******")
chose_action = np.random.choice(range(self.action_space)) # select action w.r.t the actions prob
action = np.zeros((2,4))
action[1][chose_action] = 1
print("Chose action")
print(action)
return action
# Update our target network
def trainTarget(self, states, actions, discounted_episode_rewards):
self.model.fit([states, actions, discounted_episode_rewards])
我必须说奖励实际上是一种成本,因为如果算法不能获胜,我希望损失随着时间增加。
NN使用状态(2帧)选择一个动作。我选择2是因为鬼影可以移动2帧,所以它应该能够预测运动。这将返回一个数组,其中包含每个动作的“分数”,我希望该动作具有最高分数。
NN使用当前状态,last_action和last_reward进行训练。当然,它不能训练第一次迭代,因为我还没有任何回报。
现在,我已经解释了所有内容,让我们谈谈我的问题。
当我训练时,损失不断增加,但是输出总是相同的。例如,如果在第一次迭代中最好的“分数”上升了,则为[0.9,0.08,0.02,0.1]。然后,下一个状态将为[1.0,0.0,0.0,0.0]。无论是随机策略选择的操作还是损失值。为什么会这样呢?
您可以找到完整的代码here!。