在RNN向后通过的以下实现中,Wh
,Wx
和b
的梯度是通过在每个时间步添加计算出的梯度来计算的。凭直觉,这是做什么的,为什么不对它们取平均值?
def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.
Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H)
Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
Wx, Wh, b, x, prev_h, next_h = cache[0]
N, T, H = dh.shape
D, H = Wx.shape
# Initialise gradients.
dx = np.zeros([N, T, D])
dWx = np.zeros_like(Wx)
dWh = np.zeros_like(Wh)
db = np.zeros_like(b)
dprev_h = np.zeros_like(prev_h)
# Backprop in time - start at last calculated time step and work back.
for t_step in reversed(range(T)):
# Add the current timestep upstream gradient to previous calculated dh
cur_dh = dprev_h + dh[:,t_step,:]
# Calculate gradients at this time step.
dx[:, t_step, :], dprev_h, dWx_temp, dWh_temp, db_temp = rnn_step_backward(cur_dh, cache[t_step])
# Add gradient contributions from each time step.
dWx += dWx_temp
dWh += dWh_temp
db += db_temp
# dh0 is the last hidden state gradient calculated.
dh0 = dprev_h
return dx, dh0, dWx, dWh, db
答案 0 :(得分:0)
每个时间步长都有一组梯度,就像MLP神经网络为每个训练示例提供一组梯度一样。在每个训练示例中对MLP神经网络的梯度求平均值是没有意义的。以相同的方式,不会在每个时间步上对梯度进行平均。将每个时间步都视为RNN的单独训练示例。