我正在尝试为输出层中的softmax激活和其他层中的sigmoid激活构建用于多类分类的L层神经网络。
用于训练的功能如下:
def L_layer_model(X, Y, layers_dims, learning_rate=0.01, num_iterations=5000, print_cost=True):
"""
Implements a L-layer neural network: [LINEAR->SIGMOID]*(L-1)->LINEAR->SOFTMAX.
Arguments:
X -- data, numpy array of shape (number of features, number of examples)
Y -- true "label" vector of shape (number of classes, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimization loop
print_cost -- if True, it prints the cost every 100 steps
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
np.random.seed(1)
costs = [] # keep track of cost
# Parameters initialization.
parameters = initialize_parameters_deep(layers_dims)
L = len(parameters) // 2 # number of layers in the neural network
forward_calculated = {}
m = Y.shape[1]
# Loop (gradient descent)
for i in range(0, num_iterations):
# Forward propagation: [LINEAR -> SIGMOID]*(L-1) -> LINEAR -> SOFTMAX.
A = X
forward_calculated["A0"] = X
for l in range(1, L+1):
A_prev = A
#print(A_prev)
W = parameters['W' + str(l)]
b = parameters['b' + str(l)]
#print("W.shape: "+str(W.shape))
#print("A_prev.shape: "+str(A_prev.shape))
#print("b.shape: "+str(b.shape))
Z = np.dot(W, A_prev) + b
#Z = np.matmul(W, A_prev) + b
assert(Z.shape == (W.shape[0], A.shape[1]))
forward_calculated["Z" + str(l)] = Z # store for future use
if l != L: # except the last layer
A = sigmoid(Z)
else:
A = softmax(Z)
#print("A is a tuple: ", end='')
#print(isinstance(A, tuple))
forward_calculated["A" + str(l)] = A # store for future use
assert(forward_calculated["A" + str(L)].shape == (NUMBER_OF_CLASSES, X.shape[1]))
# Compute cost.
Y_hat = forward_calculated["A" + str(L)]
cost = compute_multiclass_loss(Y, Y_hat)
#cost = compute_cost(AL, Y)
# now back propagation
grads = {}
grads['dZ' + str(L)] = forward_calculated["A" + str(L)] - Y
grads['dW' + str(L)] = (1./m) * np.dot(grads['dZ' + str(L)], forward_calculated["A" + str(L-1)].T)
grads['db' + str(L)] = (1./m) * np.sum(grads['dZ' + str(L)], axis=1, keepdims=True)
for l in range(L-1, 0, -1):
grads['dA' + str(l)] = np.dot(parameters["W" + str(l+1)].T, grads['dZ' + str(l+1)])
#dA1 = np.matmul(W2.T, dZ2)
grads['dZ' + str(l)] = grads['dA' + str(l)] * sigmoid(forward_calculated["Z" + str(l)]) * (1 - sigmoid(forward_calculated["Z" + str(l)]))
#dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
grads['dW' + str(l)] = (1./m) * np.dot(grads['dZ' + str(l)], forward_calculated["A" + str(l-1)].T)
#dW1 = (1./m) * np.matmul(dZ1, X.T)
grads['db' + str(l)] = (1./m) * np.sum(grads['dZ' + str(l)], axis=1, keepdims=True)
#db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)
# Update parameters.
for l in range(1,L+1):
#print("grads[dW]: " + str(grads["dW" + str(l)]));
parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * grads["dW" + str(l)]
#print("grads[db]: " + str(grads["db" + str(l)]));
parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * grads["db" + str(l)]
# Print the cost every 100 training example
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" % (i, cost))
costs.append(cost)
print(costs)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
return parameters
当我只有一个隐藏层时,代码工作正常,模型逐渐收敛。但是,当我有多个隐藏层时,模型似乎不会收敛。它预测所有示例都属于同一类。我的反向传播公式有什么错误吗?我使用的成本函数是日志丢失。
def compute_multiclass_loss(Y, Y_hat): # Y -> actual, Y_hat -> predicted
L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1/m) * L_sum
L = np.squeeze(L) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(L.shape == ())
return L
答案 0 :(得分:1)
代码似乎很好, 然而,这是一个名为Gradient Vanishing的概念性问题。
当你使用深层网络时,你可以看到你接近输入层,在梯度计算中,sigmoid的导数增加。
sigmoid的导数的最大值是0.25,并且情况并非总是如此,sigmoid的导数值可以接近0.001或者等等,在这种情况下,当这些小项增加时,梯度会急剧下降。 / p>
所以,ReLU在某种程度上可以解决这个问题,它的衍生物是0或1,所以如果梯度消失,那只会是权重而不是激活。
所以,隐藏图层中的USE ReLU而不是Sigmoid
迈克尔·尼尔森(Michael Nielsen)一书中的这篇文章用微积分来深入解释
http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem