Question

在RNN向后通过的以下实现中，Wh，Wx和b的梯度是通过在每个时间步添加计算出的梯度来计算的。凭直觉，这是做什么的，为什么不对它们取平均值？

def rnn_backward(dh, cache):
    """
    Compute the backward pass for a vanilla RNN over an entire sequence of data.
    Inputs:
    - dh: Upstream gradients of all hidden states, of shape (N, T, H)
    Returns a tuple of:
    - dx: Gradient of inputs, of shape (N, T, D)
    - dh0: Gradient of initial hidden state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
    - db: Gradient of biases, of shape (H,)
    """

    Wx, Wh, b, x, prev_h, next_h = cache[0]
    N, T, H = dh.shape
    D, H = Wx.shape

    # Initialise gradients.
    dx = np.zeros([N, T, D])
    dWx = np.zeros_like(Wx)
    dWh = np.zeros_like(Wh)
    db = np.zeros_like(b)
    dprev_h = np.zeros_like(prev_h)

    # Backprop in time - start at last calculated time step and work back.
    for t_step in reversed(range(T)):

        # Add the current timestep upstream gradient to previous calculated dh
        cur_dh = dprev_h + dh[:,t_step,:]

        # Calculate gradients at this time step.
        dx[:, t_step, :], dprev_h, dWx_temp, dWh_temp, db_temp = rnn_step_backward(cur_dh, cache[t_step])

        # Add gradient contributions from each time step.
        dWx += dWx_temp
        dWh += dWh_temp
        db += db_temp

    # dh0 is the last hidden state gradient calculated.
    dh0 = dprev_h
return dx, dh0, dWx, dWh, db

Answer 1

每个时间步长都有一组梯度，就像MLP神经网络为每个训练示例提供一组梯度一样。在每个训练示例中对MLP神经网络的梯度求平均值是没有意义的。以相同的方式，不会在每个时间步上对梯度进行平均。将每个时间步都视为RNN的单独训练示例。

为什么在时间的反向传播中添加梯度而不是取平均值？

1 个答案: