选择随机动作的Tensorflow Agent

时间:2019-05-25 17:47:34

标签: python numpy tensorflow

我想使用张量流创建代理。我有9种动作类别:向左滚动,向左滚动,向右滚动,刹车...等等。张量流管道的输出为array [9]。在此基础上,我将模拟WSAD的推送组合。但是有时候我想选择随机动作,但不是完全随机,而是基于密集的softmax输出。正是我想要的功能是numpy.random.multinomial。但是tensorflow.random.multinomial仅返回所选操作的索引,而不返回尺寸与输入相同的张量。我试图保存动作,并在教学阶段稍后将它们提供给代理,但是我所基于的示例将需要在播放阶段中提供不需要的动作。 我知道可以使用tensorflow.cond和tesorflow.equal来实现,但是管道看起来像乱七八糟,我不确定性能。换句话说,在Tensorflow函数中是否表现得像numpy.random.multinomial一样,还是有一个原因,为什么没有它并且我的代理的体系结构不正确?

代理商本身:

 class agentY():
    def __init__(self,lr,s_size,a_size,h_size):
        self.state_in = tf.placeholder(shape = [None]+list(s_size),dtype=tf.float32)
        conv1         = tf.layers.conv2d(self.state_in,32,4,strides=(4, 4))
        max_pool1     = tf.layers.max_pooling2d(conv1,32,4)
        flatten       = tf.layers.flatten(max_pool1)
        hidden        = tf.layers.dense(flatten,4096,activation=tf.nn.tanh)


        hidden_action       = tf.layers.dense(hidden,2048, activation=tf.nn.elu)
        self.action         = tf.layers.dense(hidden_action,9, activation=tf.nn.softmax)

        self.action_in      = tf.placeholder(shape =[None,9],dtype=tf.float32, name='acin') 
        cross_entropy       = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.action_in,
                                                                  logits=self.action)
        optimizer             = tf.train.AdamOptimizer(lr)
        grads_and_vars = optimizer.compute_gradients(cross_entropy)

        self.gradients = [grad for grad, variable in grads_and_vars]
        self.gradient_placeholders = []
        grads_and_vars_feed = []
        for grad, variable in grads_and_vars:
            gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
            self.gradient_placeholders.append(gradient_placeholder)
            grads_and_vars_feed.append((gradient_placeholder, variable))
        self.training_op = optimizer.apply_gradients(grads_and_vars_feed)

比赛阶段:

    state = get_state()
    action = sess.run([myAgent.action], feed_dict={myAgent.state_in:[state]}) #
    action = numpy.random.multinomial(1,action[0][0])
    if do_action:
        releaseKeys();
        update_pressed_keys(categoriesToKeys(action))

    reward = reward + delta_time
    current_rewards.append(reward)
    current_gradients.append(myAgent.gradients)

教学阶段:

    def teach_agent(agent, all_rewards, all_gradients,sess):
        rewards = np.array(discount_and_normalize_rewards(all_rewards,0.99))
        test = []
        feed_dict = {}
        for var_index, gradient_placeholder in enumerate(agent.gradient_placeholders):
            mean_gradients = np.mean([reward * all_gradients[game_index][step][var_index]
                                      for game_index, rewards in enumerate(all_rewards)
                                          for step, reward in enumerate(rewards)], axis=0)
            feed_dict[gradient_placeholder] = mean_gradients
        sess.run(agent.training_op, feed_dict=feed_dict)  

示教阶段尚未测试。该代码基于《使用Scikit-Learn和TensorFlow进行动手机器学习》一书

1 个答案:

答案 0 :(得分:0)

我设法使用具有[None,1]形状的标签和logit运行tensorflow.nn.softmax_cross_entropy_with_logits_v2(),其中该张量是作用指数(类别)。我在GPU端崩溃,意识到我一直都错了,却忘记了最重要的功能之一:单热编码。我使用多项式来计算索引,而不是使用一热编码的结果。下面的示例:

import tensorflow as tf
import numpy as np

p = tf.placeholder(shape = [None,4],dtype=tf.float32)
t = tf.nn.softmax(p)
t1      = tf.random.categorical(tf.log(t),1)
t2 = tf.one_hot(t1, 4,
           on_value=1.0, off_value=0.0,
           axis=-1)


with tf.Session() as sess:
    inArray = [[0.8,0.5,0.1,0.2]]
    index, outArray = sess.run([t1,t2],feed_dict={p:inArray})
    print("Index:",index)
    print("Array:",outArray)

这当然是菜鸟的错误,我是ML的新手,并且很难理解张量流。 现在,Agent看起来像这样:

class agentY():
    def __init__(self,lr,s_size,a_size,h_size):
        self.state_in = tf.placeholder(shape = [None]+list(s_size),dtype=tf.float32)
        conv1         = tf.layers.conv2d(self.state_in,32,4,strides=(4, 4))
        max_pool1     = tf.layers.max_pooling2d(conv1,32,4)
        flatten       = tf.layers.flatten(max_pool1)
        hidden        = tf.layers.dense(flatten,4096,activation=tf.nn.tanh)


        hidden_action       = tf.layers.dense(hidden,2048, activation=tf.nn.elu)
        self.action_logits  = tf.layers.dense(hidden_action,9, activation=tf.nn.softmax)
        self.action_out     = tf.one_hot(tf.multinomial(self.action_logits,1), 9,on_value=1.0, off_value=0.0,axis=-1)
        cross_entropy       = tf.nn.softmax_cross_entropy_with_logits_v2(labels=self.action_out,
                                                                  logits=self.action_logits)
        optimizer             = tf.train.AdamOptimizer(lr)
        grads_and_vars = optimizer.compute_gradients(cross_entropy)

        self.gradients = [grad for grad, variable in grads_and_vars]
        self.gradient_placeholders = []
        grads_and_vars_feed = []
        for grad, variable in grads_and_vars:
            gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
            self.gradient_placeholders.append(gradient_placeholder)
            grads_and_vars_feed.append((gradient_placeholder, variable))
        self.training_op = optimizer.apply_gradients(grads_and_vars_feed)

tf.reset_default_graph()
testAgent = agentY(0.1,(300,400,1),9,11) 

现在的问题是,Iam会根据每个Agent的决定进行渐变。这需要大量的RAM,因此我肯定不建议这样做。在下面看看:

while True:
    time0 = time.time()
    #-----------------zzx
    if collectData:
        state = get_state()
        action_out, gradients = sess.run([myAgent.action_out,myAgent.gradients], feed_dict={myAgent.state_in:[state]})

        if do_action:
            releaseKeys();
            update_pressed_keys(categoriesToKeys(action))

        reward = reward + delta_time
        current_rewards.append(reward)
        current_gradients.append(gradients)

稍后,我将在功能teacher_agent()中使用这些梯度将奖励引入Agent的网络。 (teacher_agent已发布在原始帖子中)。在我回到书中并尝试理解Q-Learning Agent的下一个示例之前,有人(如果可能的话)可以以简单的方式解释Q-Learning或强化学习的其他方式吗?