Question

我正在通过CartPole游戏学习政策梯度。我有两个实现，并且都运行良好。但是，我认为这两个示例使用相反的方式来选择要采取的措施。这是两个实现的部分代码。我只列出了网络模型以及他们如何选择操作。

方法1

observations = C.sequence.input_variable(state_dim, np.float32, name="obs")
W1 = C.parameter(shape=(state_dim, hidden_size), init=C.glorot_uniform(), name="W1")
b1 = C.parameter(shape=hidden_size, name="b1")
layer1 = C.relu(C.times(observations, W1) + b1)
W2 = C.parameter(shape=(hidden_size, action_count), init=C.glorot_uniform(), name="W2")
b2 = C.parameter(shape=action_count, name="b2")
layer2 = C.times(layer1, W2) + b2
output = C.sigmoid(layer2, name="output")

while not done:    
    state = np.reshape(observation, [1, state_dim]).astype(np.float32)
    # Run the policy network and get an action to take.
    prob = output.eval(arguments={observations: state})[0][0][0]
    # Sample from the bernoulli output distribution to get a discrete action
    action = 1 if np.random.uniform() < prob else 0
    y = 1 if action == 0 else 0 # create a "fake label" or pseudo label

方法2

    observations = tf.placeholder(tf.float32, shape=[None, state_dim])
    hidden = tf.layers.dense(observations, hidden_size, activation=tf.nn.relu, kernel_initializer=initializer)
    logits = tf.layers.dense(hidden, n_outputs)
    output = tf.nn.sigmoid(logits) # probability of action 0 (left)
    p_left_and_right = tf.concat(axis=1, values=[output, 1 - output])
    action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
    y = 1. - tf.to_float(action)

while not done:
    state = np.reshape(observation, [1, state_dim]).astype(np.float32)
    # Run the policy network and get an action to take.
    action_val, gradients_val = sess.run([action, gradients], feed_dict={observations: state})
    action_val = action_val[0][0]

我的问题是关于从政策网络中选择行动的方式。两种方法都计算出动作0的概率（变量 output ）。对我来说，使用概率为 output 的多项式的方法2是可以的。使我感到困惑的是方法1。我认为方法2与方法2相反。如果选择操作0的概率为 prob （ output ），则应将操作设置为如果np.random.uniform（）< prob 为0，不是吗？我尝试了我的想法，但结果很糟糕。我在这里误会了吗？

这些策略梯度的实现等同吗？

0 个答案: