如何将RNN用作政策

时间:2019-07-18 09:43:55

标签: tensorflow recurrent-neural-network reinforcement-learning

我正在与AurélienGéron合作-“使用Scikit-Learn和TensorFlow进行动手机器学习”。

我发现了强化学习的一个很好的例子。

在此示例中,作者使用简单的神经网络作为策略:

n_inputs = 4
n_hidden = 4
n_outputs = 1

learning_rate = 0.01

initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

hidden = fully_connected(X, n_hidden, activation_fn=tf.nn.elu, weights_initializer=initializer)
logits = fully_connected(hidden, n_outputs, activation_fn=None)
outputs = tf.nn.sigmoid(logits)  # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
# output - only 1 activity
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)

我有2个动作中的1个作为输出。

基于此示例,我尝试构建我的策略,在该策略中,我有3个输出神经元,最重要的是递归神经网络。 我尝试找到3个活动中的1个。

此刻我的代码:

n_inputs = 2 # two input vectors
n_steps = 10 # inputs are 10 elements vectors
n_neurons = 30
n_outputs = 3

learning_rate = 0.01
initializer = tf.contrib.layers.variance_scaling_initializer()

# example of X below
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) 

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs) 
out_softmax = tf.nn.softmax(logits) # I'm expecting 3 outputs

# action is the probability 1 of 3 activities, so:
action = tf.multinomial(tf.log(out_softmax), num_samples=1) 
y = tf.to_float(action) 

# gradients
#
# ....
#

如果我的输入向量如下:

#(X size = (1, 24, 2))
X = [[0,1,2,... 23], [0,1,2,... 23]]

我遇到此错误:

  

检查失败:NDIMS == new_sizes.size()(2比1)

     

进程以退出代码-1073740791(0xC0000409)完成

为什么? 我确定输出应该是y = 0(或1或2)(操作0、1或2)之类的值

也许我听不懂?你能帮忙吗?

当然,如果我的X是这样的话:

#(X size = (n, 24, 2)  (n>=2)
#e.g.:
X =[[[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]],
    [[0,1,2,... 23], [0,1,2,... 23]]]

我的政策正常。

我希望这是一篇重要的文章,对问题的描述足够好。如果没有-请让我知道!

0 个答案:

没有答案