我正在通过CartPole游戏学习政策梯度。我有两个实现,并且都运行良好。但是,我认为这两个示例使用相反的方式来选择要采取的措施。这是两个实现的部分代码。我只列出了网络模型以及他们如何选择操作。
方法1
observations = C.sequence.input_variable(state_dim, np.float32, name="obs")
W1 = C.parameter(shape=(state_dim, hidden_size), init=C.glorot_uniform(), name="W1")
b1 = C.parameter(shape=hidden_size, name="b1")
layer1 = C.relu(C.times(observations, W1) + b1)
W2 = C.parameter(shape=(hidden_size, action_count), init=C.glorot_uniform(), name="W2")
b2 = C.parameter(shape=action_count, name="b2")
layer2 = C.times(layer1, W2) + b2
output = C.sigmoid(layer2, name="output")
while not done:
state = np.reshape(observation, [1, state_dim]).astype(np.float32)
# Run the policy network and get an action to take.
prob = output.eval(arguments={observations: state})[0][0][0]
# Sample from the bernoulli output distribution to get a discrete action
action = 1 if np.random.uniform() < prob else 0
y = 1 if action == 0 else 0 # create a "fake label" or pseudo label
方法2
observations = tf.placeholder(tf.float32, shape=[None, state_dim])
hidden = tf.layers.dense(observations, hidden_size, activation=tf.nn.relu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
output = tf.nn.sigmoid(logits) # probability of action 0 (left)
p_left_and_right = tf.concat(axis=1, values=[output, 1 - output])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
y = 1. - tf.to_float(action)
while not done:
state = np.reshape(observation, [1, state_dim]).astype(np.float32)
# Run the policy network and get an action to take.
action_val, gradients_val = sess.run([action, gradients], feed_dict={observations: state})
action_val = action_val[0][0]
我的问题是关于从政策网络中选择行动的方式。两种方法都计算出动作0的概率(变量 output )。对我来说,使用概率为 output 的多项式的方法2是可以的。使我感到困惑的是方法1。我认为方法2与方法2相反。如果选择操作0的概率为 prob ( output ),则应将操作设置为如果np.random.uniform()< prob 为0,不是吗?我尝试了我的想法,但结果很糟糕。我在这里误会了吗?