以下是paras的首字母:
hf = sigmoid(X @ Wf + bf)
hi = sigmoid(X @ Wi + bi)
ho = sigmoid(X @ Wo + bo)
hc = tanh(X @ Wc + bc)
c = hf * c_old + hi * hc
h = ho * tanh(c)
y = h @ Wy + by
prob = softmax(y)
当我在LSTM中倒退时,:
# Softmax loss gradient
dy = prob.copy()
dy[1, y_train] -= 1.
# Hidden to output gradient
dWy = h.T @ dy
dby = dy
# Note we're adding dh_next here
dh = dy @ Wy.T + dh_next
# Gradient for ho in h = ho * tanh(c)
dho = tanh(c) * dh
dho = dsigmoid(ho) * dho (**** problem here)
我了解“ dho = tanh(c)* dh”, 但不知道为什么会这样:“ dho = dsigmoid(ho)* dho”
一些超级英雄能告诉我为什么会得出这样的结论吗?