Question

我正在对时间序列预测问题进行强化学习。到现在为止，我已经用LSTM实现了对决DDQN算法，该算法似乎给出了一些不错的结果，尽管有时会因确切的问题而收敛缓慢。然后，我使用C51分布强化学习来比较性能（我希望这会带来更好的效果）。

我对C51谷歌代码dopamine进行了略微调整，以将其集成到我的代码中（网络和培训部分）。我还将Double Q学习用于选择下一个状态动作（原始代码未使用）。但是，问题是执行起来确实非常缓慢。为了进行比较，我以前的决斗DDQN过去需要花费3.5个小时来训练50000集，但是C51算法现在已经花费了将近10个小时，但是只有3000集。

我想知道我对代码的修改是否有问题，或者C51算法真的那么慢。我正在使用NVidia Geforce RTX 2080 Ti。

这是网络部分：

#network part
self.weights_initializer = tf.contrib.slim.variance_scaling_initializer(factor=1.0 / np.sqrt(3.0), mode='FAN_IN', uniform=True)
self.net = tf.contrib.slim.fully_connected(
  self.rnn, # output of an LSTM
  num_actions * num_atoms,
  activation_fn=None,
  weights_initializer=self.weights_initializer)

self.logits = tf.reshape(self.net, [-1, num_actions, num_atoms])
self.probabilities = tf.contrib.layers.softmax(self.logits)
self.q_values = tf.reduce_sum(self._support * self.probabilities, axis=2)

self.predict = tf.argmax(self.q_values,1)

self.actions = tf.placeholder(shape=[None],dtype=tf.int32)    

self.target_distribution = tf.placeholder(shape=[None,num_atoms],dtype=tf.float32)

# size of indices: batch_size x 1.
self.indices = tf.range(tf.shape(self.logits)[0])[:, None]
# size of reshaped_actions: batch_size x 2.
self.reshaped_actions = tf.concat([self.indices, self.actions[:, None]], 1)
# For each element of the batch, fetch the logits for its selected action.
self.chosen_action_logits = tf.gather_nd(self.logits,
                                self.reshaped_actions)

self.td_error = tf.nn.softmax_cross_entropy_with_logits(labels=self.target_distribution,logits=self.chosen_action_logits)


# divide by the real length of episodes instead of averaging which is incorrect
self.loss = tf.cast(tf.reduce_sum(self.td_error), tf.float64) / tf.cast(tf.reduce_sum(self.seq_len), tf.float64)

if apply_grad_clipping:
   # calculate gradients and clip them to handle outliers
   tvars = tf.trainable_variables()
   grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), grad_clipping)
   self.updateModel = optimizer.apply_gradients(
        zip(grads, tvars),
        name="updateModel")
else:
   self.updateModel = optimizer.minimize(self.loss, name="updateModel")

这是培训部分：

# training part
if i >= pre_train_episodes:
        #Reset the lstm's hidden state
        state_train = np.zeros((num_layers, 2, batch_size, h_size))
        #Get a random batch of experiences.
        trainBatch = myBuffer.sample(batch_size)
        #Below we perform the Double-DQN update to the target Q-values
        num_samples = batch_size*trace_length
        # size of rewards: batch_size x 1
        rewards = trainBatch[:,2][:, None]

        # size of tiled_support: batch_size x num_atoms
        tiled_support = tf.tile(mainQN._support, [num_samples])
        tiled_support = tf.reshape(tiled_support, [num_samples, num_atoms])

        # size of target_support: batch_size x num_atoms
        is_terminal_multiplier = -(np.array(trainBatch[:,4]) - 1)
        # Incorporate terminal state to discount factor.
        # size of gamma_with_terminal: batch_size x 1
        gamma_with_terminal = gamma * is_terminal_multiplier
        gamma_with_terminal = gamma_with_terminal[:, None]

        target_support = rewards + gamma_with_terminal * tiled_support


        next_qt_argmax = sess.run([mainQN.predict], feed_dict={\
                            mainQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            mainQN.trainLength:trace_length,mainQN.state_in:state_train,mainQN.batch_size:batch_size})
        next_qt_argmax = np.reshape(next_qt_argmax, [-1, 1])
        probabilities = sess.run(targetQN.probabilities, feed_dict={\
                            targetQN.scalarInput:np.vstack(trainBatch[:,3]),\
                            targetQN.trainLength:trace_length,targetQN.state_in:state_train,targetQN.batch_size:batch_size})
        batch_indices = np.arange(num_samples)[:, None]
        batch_indexed_next_qt_argmax = np.concatenate([batch_indices, next_qt_argmax], axis=1)


        # size of next_probabilities: batch_size x num_atoms
        next_probabilities = tf.gather_nd(probabilities, batch_indexed_next_qt_argmax)


        target_distribution = project_distribution(target_support, next_probabilities, mainQN._support)
        target_distribution = target_distribution.eval()

        loss, _, _ = sess.run([mainQN.loss, mainQN.check_ops, mainQN.updateModel], \
                            feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.target_distribution:target_distribution,\
                            mainQN.actions:trainBatch[:,1],mainQN.trainLength:trace_length,\
                            mainQN.state_in:state_train,mainQN.batch_size:batch_size})

        # perform soft/hard update frequently
        if i % update_target_freq == 0 or update_target_freq == 1 or softUpdate == True:
            updateTarget(targetOps,sess)

辅助功能：

# function used above to project the distribution on the provided support
def project_distribution(supports, weights, target_support,
                     validate_args=False):
"""Projects a batch of (support, weights) onto target_support.
Based on equation (7) in (Bellemare et al., 2017):
https://arxiv.org/abs/1707.06887
In the rest of the comments we will refer to this equation simply as Eq7.
This code is not easy to digest, so we will use a running example to  clarify
what is going on, with the following sample inputs:
* supports =       [[0, 2, 4, 6, 8],
                    [1, 3, 4, 5, 6]]
* weights =        [[0.1, 0.6, 0.1, 0.1, 0.1],
                    [0.1, 0.2, 0.5, 0.1, 0.1]]
* target_support = [4, 5, 6, 7, 8]
In the code below, comments preceded with 'Ex:' will be referencing the above
values.
Args:
supports: Tensor of shape (batch_size, num_dims) defining supports for the
  distribution.
weights: Tensor of shape (batch_size, num_dims) defining weights on the
  original support points. Although for the CategoricalDQN agent these
  weights are probabilities, it is not required that they are.
target_support: Tensor of shape (num_dims) defining support of the projected
  distribution. The values must be monotonically increasing. Vmin and Vmax
  will be inferred from the first and last elements of this tensor,
  respectively. The values in this tensor must be equally spaced.
  validate_args: Whether we will verify the contents of the
  target_support parameter.
  Returns:
     A Tensor of shape (batch_size, num_dims) with the projection of a batch of
(support, weights) onto target_support.
  Raises:
    ValueError: If target_support has no dimensions, or if shapes of supports,
  weights, and target_support are incompatible.
 """
target_support_deltas = target_support[1:] - target_support[:-1]
# delta_z = `\Delta z` in Eq7.
delta_z = target_support_deltas[0]
validate_deps = []
supports.shape.assert_is_compatible_with(weights.shape)
supports[0].shape.assert_is_compatible_with(target_support.shape)
target_support.shape.assert_has_rank(1)
if validate_args:
# Assert that supports and weights have the same shapes.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(tf.shape(supports), tf.shape(weights))),
        [supports, weights]))
# Assert that elements of supports and target_support have the same shape.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(
            tf.equal(tf.shape(supports)[1], tf.shape(target_support))),
        [supports, target_support]))
# Assert that target_support has a single dimension.
validate_deps.append(
    tf.Assert(
        tf.equal(tf.size(tf.shape(target_support)), 1), [target_support]))
# Assert that the target_support is monotonically increasing.
validate_deps.append(
    tf.Assert(tf.reduce_all(target_support_deltas > 0), [target_support]))
# Assert that the values in target_support are equally spaced.
validate_deps.append(
    tf.Assert(
        tf.reduce_all(tf.equal(target_support_deltas, delta_z)),
        [target_support]))

with tf.control_dependencies(validate_deps):
# Ex: `v_min, v_max = 4, 8`.
v_min, v_max = target_support[0], target_support[-1]
# Ex: `batch_size = 2`.
batch_size = tf.shape(supports)[0]
# `N` in Eq7.
# Ex: `num_dims = 5`.
num_dims = tf.shape(target_support)[0]
# clipped_support = `[\hat{T}_{z_j}]^{V_max}_{V_min}` in Eq7.
# Ex: `clipped_support = [[[ 4.  4.  4.  6.  8.]]
#                         [[ 4.  4.  4.  5.  6.]]]`.
clipped_support = tf.clip_by_value(supports, v_min, v_max)[:, None, :]
# Ex: `tiled_support = [[[[ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]
#                         [ 4.  4.  4.  6.  8.]]
#                        [[ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]
#                         [ 4.  4.  4.  5.  6.]]]]`.
tiled_support = tf.tile([clipped_support], [1, 1, num_dims, 1])
# Ex: `reshaped_target_support = [[[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]
#                                 [[ 4.]
#                                  [ 5.]
#                                  [ 6.]
#                                  [ 7.]
#                                  [ 8.]]]`.
reshaped_target_support = tf.tile(target_support[:, None], [batch_size, 1])
reshaped_target_support = tf.reshape(reshaped_target_support,
                                     [batch_size, num_dims, 1])
# numerator = `|clipped_support - z_i|` in Eq7.
# Ex: `numerator = [[[[ 0.  0.  0.  2.  4.]
#                     [ 1.  1.  1.  1.  3.]
#                     [ 2.  2.  2.  0.  2.]
#                     [ 3.  3.  3.  1.  1.]
#                     [ 4.  4.  4.  2.  0.]]
#                    [[ 0.  0.  0.  1.  2.]
#                     [ 1.  1.  1.  0.  1.]
#                     [ 2.  2.  2.  1.  0.]
#                     [ 3.  3.  3.  2.  1.]
#                     [ 4.  4.  4.  3.  2.]]]]`.
numerator = tf.abs(tiled_support - reshaped_target_support)
quotient = 1 - (numerator / delta_z)
# clipped_quotient = `[1 - numerator / (\Delta z)]_0^1` in Eq7.
# Ex: `clipped_quotient = [[[[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  1.]]
#                           [[ 1.  1.  1.  0.  0.]
#                            [ 0.  0.  0.  1.  0.]
#                            [ 0.  0.  0.  0.  1.]
#                            [ 0.  0.  0.  0.  0.]
#                            [ 0.  0.  0.  0.  0.]]]]`.
clipped_quotient = tf.clip_by_value(quotient, 0, 1)
# Ex: `weights = [[ 0.1  0.6  0.1  0.1  0.1]
#                 [ 0.1  0.2  0.5  0.1  0.1]]`.
weights = weights[:, None, :]
# inner_prod = `\sum_{j=0}^{N-1} clipped_quotient * p_j(x', \pi(x'))`
# in Eq7.
# Ex: `inner_prod = [[[[ 0.1  0.6  0.1  0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0.1]]
#                     [[ 0.1  0.2  0.5  0.  0. ]
#                      [ 0.   0.   0.   0.1 0. ]
#                      [ 0.   0.   0.   0.  0.1]
#                      [ 0.   0.   0.   0.  0. ]
#                      [ 0.   0.   0.   0.  0. ]]]]`.
inner_prod = clipped_quotient * weights
# Ex: `projection = [[ 0.8 0.0 0.1 0.0 0.1]
#                    [ 0.8 0.1 0.1 0.0 0.0]]`.
projection = tf.reduce_sum(inner_prod, 3)
projection = tf.reshape(projection, [batch_size, num_dims])
return projection

提前谢谢！

Answer 1

如果您的 GPU 有任何问题，Tensorflow 会在您第一次运行脚本时通过警告通知您。

通常，C51-DQN 算法比 DQN 慢。这是因为计算动作奖励的分布需要更长的时间，而不是计算动作的期望值。

此外，Google 的 Dopamine Rainbow/C51 实现比您的自定义实现更快，因为内存缓冲区直接连接到 TF Graph。这意味着 tensorflow 不会浪费时间

从内存 (RAM) 中检索经验
将 numpy 数组转换为张量
进行计算和连接列
将结果提供给网络。

相反，以上所有操作都是在 GPU 内部直接完成的。

如果您希望程序变得更快，您可以做以下几件事：

将体验存储在 TF 变量中，而不是 RAM 中。
使用 @tf.function 在图中添加所有计算（例如状态的前向计算）。

C51强化学习算法极慢

1 个答案: