我正在尝试了解MLP分类器和回归器对L2正则化的正确用法。我目前正在研究以下描述:regularization
我也在看SciKit Learn的实现here。
_backprop
方法是(删除注释):
def _backprop(self, X, y, activations, deltas, coef_grads,
intercept_grads):
n_samples = X.shape[0]
# Forward propagate
activations = self._forward_pass(activations)
# Get loss
loss_func_name = self.loss
if loss_func_name == 'log_loss' and self.out_activation_ == 'logistic':
loss_func_name = 'binary_log_loss'
loss = LOSS_FUNCTIONS[loss_func_name](y, activations[-1])
# Add L2 regularization term to loss
values = np.sum(
np.array([np.dot(s.ravel(), s.ravel()) for s in self.coefs_]))
loss += (0.5 * self.alpha) * values / n_samples
# Backward propagate
last = self.n_layers_ - 2
# The calculation of delta[last] here works with following
# combinations of output activation and loss function:
# sigmoid and binary cross entropy, softmax and categorical cross
# entropy, and identity with squared loss
deltas[last] = activations[-1] - y
# Compute gradient for the last layer
coef_grads, intercept_grads = self._compute_loss_grad(
last, n_samples, activations, deltas, coef_grads, intercept_grads)
# Iterate over the hidden layers
for i in range(self.n_layers_ - 2, 0, -1):
deltas[i - 1] = safe_sparse_dot(deltas[i], self.coefs_[i].T)
inplace_derivative = DERIVATIVES[self.activation]
inplace_derivative(activations[i], deltas[i - 1])
coef_grads, intercept_grads = self._compute_loss_grad(
i - 1, n_samples, activations, deltas, coef_grads,
intercept_grads)
return loss, coef_grads, intercept_grads
它将损失作为三个返回参数返回。但是,在此方法本身中,即使附加了L2正则化值,也不会使用损失。请不要使用deltas[last] = activations[-1] - y
一词。这是初始梯度计算。适用于回归(具有MSE损失)和分类(具有交叉熵损失)。但是不应该将正则化术语添加到此吗?否则,在反向传播和权重更新中根本就不会使用正则化术语。
从_fit_stochastic
中调用上述方法。
相关代码:
batch_loss, coef_grads, intercept_grads = self._backprop(
X[batch_slice], y[batch_slice], activations, deltas,
coef_grads, intercept_grads)
accumulated_loss += batch_loss * (batch_slice.stop -
batch_slice.start)
# update weights
grads = coef_grads + intercept_grads
self._optimizer.update_params(grads)
batch_loss
仅用于修改accumulated_loss
,这将检查模型正在累积多少损失。权重在self._optimizer.update_params(grads)
中更新,但是正则化项在这里无效。
如果不使用正则化,则将其作为最终激活是有意义的-预期结果将成为损耗梯度。但是,由于使用了正则化,L2 reg参数是否不应该附加到损失上,并用于每次权重更新?