为什么ScikitLearn的MLP类不使用正则化更新权重

时间:2019-10-16 03:03:33

标签: python machine-learning scikit-learn neural-network

我正在尝试了解MLP分类器和回归器对L2正则化的正确用法。我目前正在研究以下描述:regularization

我也在看SciKit Learn的实现here_backprop方法是(删除注释):

def _backprop(self, X, y, activations, deltas, coef_grads,
                  intercept_grads):

n_samples = X.shape[0]

        # Forward propagate
        activations = self._forward_pass(activations)

        # Get loss
        loss_func_name = self.loss
        if loss_func_name == 'log_loss' and self.out_activation_ == 'logistic':
            loss_func_name = 'binary_log_loss'
        loss = LOSS_FUNCTIONS[loss_func_name](y, activations[-1])
        # Add L2 regularization term to loss
        values = np.sum(
            np.array([np.dot(s.ravel(), s.ravel()) for s in self.coefs_]))
        loss += (0.5 * self.alpha) * values / n_samples

        # Backward propagate
        last = self.n_layers_ - 2

        # The calculation of delta[last] here works with following
        # combinations of output activation and loss function:
        # sigmoid and binary cross entropy, softmax and categorical cross
        # entropy, and identity with squared loss
        deltas[last] = activations[-1] - y

        # Compute gradient for the last layer
        coef_grads, intercept_grads = self._compute_loss_grad(
            last, n_samples, activations, deltas, coef_grads, intercept_grads)

        # Iterate over the hidden layers
        for i in range(self.n_layers_ - 2, 0, -1):
            deltas[i - 1] = safe_sparse_dot(deltas[i], self.coefs_[i].T)
            inplace_derivative = DERIVATIVES[self.activation]
            inplace_derivative(activations[i], deltas[i - 1])

            coef_grads, intercept_grads = self._compute_loss_grad(
                i - 1, n_samples, activations, deltas, coef_grads,
                intercept_grads)

        return loss, coef_grads, intercept_grads

它将损失作为三个返回参数返回。但是,在此方法本身中,即使附加了L2正则化值,也不会使用损失。请不要使用deltas[last] = activations[-1] - y一词。这是初始梯度计算。适用于回归(具有MSE损失)和分类(具有交叉熵损失)。但是不应该将正则化术语添加到此吗?否则,在反向传播和权重更新中根本就不会使用正则化术语。

_fit_stochastic中调用上述方法。

相关代码:

batch_loss, coef_grads, intercept_grads = self._backprop(
                        X[batch_slice], y[batch_slice], activations, deltas,
                        coef_grads, intercept_grads)
                    accumulated_loss += batch_loss * (batch_slice.stop -
                                                      batch_slice.start)

                    # update weights
                    grads = coef_grads + intercept_grads
                    self._optimizer.update_params(grads)

batch_loss仅用于修改accumulated_loss,这将检查模型正在累积多少损失。权重在self._optimizer.update_params(grads)中更新,但是正则化项在这里无效。

如果不使用正则化,则将其作为最终激活是有意义的-预期结果将成为损耗梯度。但是,由于使用了正则化,L2 reg参数是否不应该附加到损失上,并用于每次权重更新?

0 个答案:

没有答案