Question

我正在尝试使用近端策略优化从头开始构建强化学习算法。我面临的问题是，当我尝试使用{session.run（）}调用它时，我的图形优化器对象无法识别。我收到的错误是“ NameError：名称'optimizer'未定义”。我试图从位于定义LSTM网络的类中的方法进行调用。我实际上是从另一个班级做同样的事情来从网络上获取预测，并且可以，但是我不明白为什么这个调用不起作用。

def __init__(self, input_size, output_size, session):
    """
    input_size: dimension of input environment - OpenAI cartpole = 4
    output_size: dimension of action space - OpenAI cartpole = 2
    """
    # LSTM expects input to be 3D Tensor
    # Reshape input for LSTM
    self.env = tf.placeholder(dtype=tf.float32, shape=[1, None, input_size])
    #self.environment = tf.reshape(self.env, shape=[1, None, input_size])

    #obsp = tf.placeholder(dtype=tf.float32, shape=[None,])
    #delta = tf.reshape(obsp, shape=[1, 1, 4])

    self.lstm1 = tf.keras.layers.LSTM(8, return_sequences=False)(self.env)
    # Softmax returns probabilty distribution of actions and what NN predicts will be best
    self.actor = tf.keras.layers.Dense(output_size, activation="softmax")(self.lstm1)
    self.critic = tf.keras.layers.Dense(1, activation=None)(self.lstm1)

    return_ = tf.placeholder(dtype=tf.float32, shape=[None, 1])
    actor_loss = tf.placeholder(dtype=tf.float32, shape=[None, 1])
    entropy = tf.placeholder(dtype=tf.float32, shape=[None, 1])

    critic_loss = (return_ - self.critic)**2
    loss = 0.5 * critic_loss + actor_loss - 0.001 * entropy
    # There is no connection between the optimizer and the rest of the graph

    optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)

    init = tf.global_variables_initializer()
    initlocal = tf.local_variables_initializer()

    session.run([init, initlocal])

def update(self, session, epochs, batch_size, states, actions, log_probs, returns, advantages):
    """
    Update neural network with experience buffers
    ##############################################
    epochs: number of epochs to train neural network for
    batch_size: size of data batches used for training
    states: array of environment states
    actions: array of actions taken
    log_probs: log probability of actions
    returns: array of estimated advantages at given times
    advantages: array of advantages, calculated by different between predicted returns and estimated
    """
    # Clipping value for PPO
    clip = 0.2

    for e in range(epochs):

        for state, action, old_log_prob, return_, advantage in self.make_batches(states, actions, log_probs, returns, advantages):

            new_actions, value = self.forward(state, session)

            new_log_prob = np.log(new_actions[action])
            ent = entropy(actions)    # NOTE: from scipy.stats import entropy

            #ratio = np.exp(new_log_prob - old_log_prob)
            ratio = np.exp(np.mean(np.mean(new_log_prob - old_log_prob, axis=1), axis=1))

            a = np.mean(ratio * advantage)
            b = np.mean(np.clip(ratio, 1 - clip, 1 + clip) * advantage)

            actor_loss = - np.min([a, b])

            session.run(optimizer, feed_dict={actor_loss: actor_loss, return_: returns, self.env: state, entropy: ent})

Answer 1

A）我假设两个函数__init__和update在同一类中。如果是这样，请执行以下操作：

1）在self.函数的optimizer的开头添加关键字__init__：

您将获得：self.optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)

2）在self.函数的optimizer的开头添加关键字update：

您将获得：session.run(self.optimizer, feed_dict={actor_loss: actor_loss, return_: returns, self.env: state, entropy: ent})

B）如果它们不是同一类中的不是，则必须将optimizer作为参数传递给update函数。< / p>

您将获得：def update(self, session, epochs, batch_size, states, actions, log_probs, returns, advantages, optimizer):

为什么TensorFlow的优化器对象无法识别？

1 个答案: