我最近尝试从头开始实现香草RNN
。我实现了所有内容,甚至运行了一个看似不错的示例!但是我注意到渐变检查不成功!并且只有某些部分(特别是输出的权重和偏差)通过了梯度检查,而其他权重(Whh
,Whx
)没有通过梯度检查。
我遵循了karpathy / corsera的实现,并确保一切均已实现。但是karpathy / corsera的代码通过了梯度检查,而我的没有通过。我目前不知道这是什么原因!
以下是原始代码中负责向后传递的代码段:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
gradients['dWya'] += np.dot(dy, a.T)
gradients['dby'] += dy
da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['db'] += daraw
gradients['dWax'] += np.dot(daraw, x.T)
gradients['dWaa'] += np.dot(daraw, a_prev.T)
gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
return gradients
def rnn_backward(X, Y, parameters, cache):
# Initialize gradients as an empty dictionary
gradients = {}
# Retrieve from cache and parameters
(y_hat, a, x) = cache
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
# each one should be initialized to zeros of the same dimension as its corresponding parameter
gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
gradients['da_next'] = np.zeros_like(a[0])
### START CODE HERE ###
# Backpropagate through time
for t in reversed(range(len(X))):
dy = np.copy(y_hat[t])
# this means, subract the correct answer from the predicted value (1-the predicted value which is specified by Y[t])
dy[Y[t]] -= 1
gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
### END CODE HERE ###
return gradients, a
这是我的实现:
def rnn_cell_backward(self, xt, h, h_prev, output, true_label, dh_next):
"""
Runs a single backward pass once.
Inputs:
- xt: The input data of shape (Batch_size, input_dim_size)
- h: The next hidden state at timestep t(which comes from the forward pass)
- h_prev: The previous hidden state at timestep t-1
- output : The output at the current timestep
- true_label: The label for the current timestep, used for calculating loss
- dh_next: The gradient of hidden state h (dh) which in the beginning
is zero and is updated as we go backward in the backprogagation.
the dh for the next round, would come from the 'dh_prev' as we will see shortly!
Just remember the backward pass is essentially a loop! and we start at the end
and traverse back to the beginning!
Returns :
- dW1 : The gradient for W1
- dW2 : The gradient for W2
- dW3 : The gradient for W3
- dbh : The gradient for bh
- dbo : The gradient for bo
- dh_prev : The gradient for previous hiddenstate at timestep t-1. this will be used
as the next dh for the next round of backpropagation.
- per_ts_loss : The loss for current timestep.
"""
e = np.copy(output)
# correct idx for each row(sample)!
idxs = np.argmax(true_label, axis=1)
# number of rows(samples) in our batch
rows = np.arange(e.shape[0])
# This is the vectorized version of error_t = output_t - label_t or simply e = output[t] - 1
# where t refers to the index in which label is 1.
e[rows, idxs] -= 1
# This is used for our loss to see how well we are doing during training.
per_ts_loss = output[rows, idxs].sum()
# must have shape of W3 which is (vocabsize_or_output_dim_size, hidden_state_size)
dW3 = np.dot(e.T, h)
# dbo = e.1, since we have batch we use np.sum
# e is a vector, when it is subtracted from label, the result will be added to dbo
dbo = np.sum(e, axis=0)
# when calculating the dh, we also add the dh from the next timestep as well
# when we are in the last timestep, the dh_next is initially zero.
dh = np.dot(e, self.W3) + dh_next # from later cell
# the input part
dtanh = (1 - h * h) * dh
# dbh = dtanh.1, we use sum, since we have a batch
dbh = np.sum(dtanh, axis=0)
# compute the gradient of the loss with respect to W1
# this is actually not needed! we only care about tune-able
# parameters, so we are only after, W1,W2,W3, db and do
# dxt = np.dot(dtanh, W1.T)
# must have the shape of (vocab_size, hidden_state_size)
dW1 = np.dot(xt.T, dtanh)
# compute the gradient with respect to W2
dh_prev = np.dot(dtanh, self.W2)
# shape must be (HiddenSize, HiddenSize)
dW2 = np.dot(h_prev.T, dtanh)
return dW1, dW2, dW3, dbh, dbo, dh_prev, per_ts_loss
def rnn_layer_backward(self, Xt, labels, H, O):
"""
Runs a full backward pass on the given data. and returns the gradients.
Inputs:
- Xt: The input data of shape (Batch_size, timesteps, input_dim_size)
- labels: The labels for the input data
- H: The hiddenstates for the current layer prodced in the foward pass
of shape (Batch_size, timesteps, HiddenStateSize)
- O: The output for the current layer of shape (Batch_size, timesteps, outputsize)
Returns :
- dW1: The gradient for W1
- dW2: The gradient for W2
- dW3: The gradient for W3
- dbh: The gradient for bh
- dbo: The gradient for bo
- dh: The gradient for the hidden state at timestep t
- loss: The current loss
"""
dW1 = np.zeros_like(self.W1)
dW2 = np.zeros_like(self.W2)
dW3 = np.zeros_like(self.W3)
dbh = np.zeros_like(self.bh)
dbo = np.zeros_like(self.bo)
dh_next = np.zeros_like(H[:, 0, :])
hprev = None
_, T_x, _ = Xt.shape
loss = 0
for t in reversed(range(T_x)):
# this if-else block can be removed! and for hprev, we can simply
# use H[:,t -1, : ] instead, but I also add this in case it makes a
# a difference! so far I have not seen any difference though!
if t > 0:
hprev = H[:, t - 1, :]
else:
hprev = np.zeros_like(H[:, 0, :])
dw_1, dw_2, dw_3, db_h, db_o, dh_prev, e = self.rnn_cell_backward(Xt[:, t, :],
H[:, t, :],
hprev,
O[:, t, :],
labels[:, t, :],
dh_next)
dh_next = dh_prev
dW1 += dw_1
dW2 += dw_2
dW3 += dw_3
dbh += db_h
dbo += db_o
# Update the loss by substracting the cross-entropy term of this time-step from it.
loss -= np.log(e)
return dW1, dW2, dW3, dbh, dbo, dh_next, loss
我已评论了所有内容,并提供了一个最小的示例来演示此内容:
My code :(未通过渐变检查)
这是我用作指导的实现。这来自karpathy / Coursera,并通过了所有渐变检查!: original code
在这一点上,我完全一无所知,因为我不知道为什么这不起作用!我是Python的新手,所以这可能就是为什么我找不到问题的原因!
答案 0 :(得分:0)
2个月后,我想我找到了罪魁祸首!我应该更改以下行:
# compute the gradient with respect to W2
dh_prev = np.dot(dtanh, self.W2)
到
# compute the gradient with respect to W2
# note the transpose here!
dh_prev = np.dot(dtanh, self.W2.T)
当我最初编写后退通道时,我只注意尺寸,这使我犯了这个错误。这实际上是一个混乱特征的示例,可能会在盲目/盲目重塑/转置(或不这样做!)中发生
为了得到这里出了什么问题,让我举一个例子。
假设我们有一个人物特征矩阵,并且我们将每一行专用于每个人,因此我们的矩阵将如下所示:
Features | Age | height(cm) | weight(kg) |
matrix = | 20 | 185 | 75 |
| 85 | 155 | 95 |
| 40 | 205 | 120 |
现在,如果我们将其制成一个numpy数组,我们将具有以下内容:
m = np.array([[20, 185, 75],
[85, 155, 95],
[40, 205, 120]])
一个简单的3x3数组对吗?
现在,我们解释矩阵的方式非常重要,这里每一行和每一列都有特定的含义。每个人使用一行来描述,每一列都是一个特定的特征向量。
因此,您会看到矩阵中存在一个“结构”,用来表示我们的数据。
换句话说,每个数据项都表示为一行,而每一列则指定一个功能。与另一个矩阵相乘时,应注意此语义,即当两个矩阵相乘时,每个数据行都必须具有此语义。
让我们举个例子,让它更清楚:
假设我们有两个矩阵:
m1 = np.array([[20, 185, 75],
[85, 155, 95],
[40, 205, 120]])
m2 = np.array([[0.9, 0.8, 0.85],
[0.1, 0.5, 0.4],
[0.6, 0.9, 0.8]])
这两个矩阵包含按行排列的数据,因此,将它们相乘将得出正确的答案。但是,例如使用Transpose更改数据的顺序,将破坏语义,我们将相乘无关的数据! /> 就我而言,我需要转置第二个矩阵以使顺序正确 对于手头的操作!这样就可以修复渐变检查了!