早期停止在Sklearn GradientBoostingRegressor

时间:2017-09-18 13:51:19

标签: python-2.7 machine-learning scikit-learn

我正在使用一个监控类来实现here

class Monitor():

    """Monitor for early stopping in Gradient Boosting for classification.

    The monitor checks the validation loss between each training stage. When
    too many successive stages have increased the loss, the monitor will return
    true, stopping the training early.

    Parameters
    ----------
    X_valid : array-like, shape = [n_samples, n_features]
      Training vectors, where n_samples is the number of samples
      and n_features is the number of features.
    y_valid : array-like, shape = [n_samples]
      Target values (integers in classification, real numbers in
      regression)
      For classification, labels must correspond to classes.
    max_consecutive_decreases : int, optional (default=5)
      Early stopping criteria: when the number of consecutive iterations that
      result in a worse performance on the validation set exceeds this value,
      the training stops.
    """

    def __init__(self, X_valid, y_valid, max_consecutive_decreases=5):
        self.X_valid = X_valid
        self.y_valid = y_valid
        self.max_consecutive_decreases = max_consecutive_decreases
        self.losses = []


    def __call__(self, i, clf, args):
        if i == 0:
            self.consecutive_decreases_ = 0
            self.predictions = clf._init_decision_function(self.X_valid)

        predict_stage(clf.estimators_, i, self.X_valid, clf.learning_rate,
                      self.predictions)
        self.losses.append(clf.loss_(self.y_valid, self.predictions))

        if len(self.losses) >= 2 and self.losses[-1] > self.losses[-2]:
            self.consecutive_decreases_ += 1
        else:
            self.consecutive_decreases_ = 0

        if self.consecutive_decreases_ >= self.max_consecutive_decreases:
            print("f"
                  "({}): s {}.".format(self.consecutive_decreases_, i)),
            return True
        else:
            return False

params = { 'n_estimators':             nEstimators,
           'max_depth':                maxDepth,
           'min_samples_split':        minSamplesSplit,
           'min_samples_leaf':         minSamplesLeaf,
           'min_weight_fraction_leaf': minWeightFractionLeaf,
           'min_impurity_decrease':    minImpurityDecrease,
           'learning_rate':            0.01,
           'loss':                    'quantile',
           'alpha':                    alpha,
           'verbose':                  0
           }
model = ensemble.GradientBoostingRegressor( **params )
model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )  

效果很好。但是,我不清楚这条线是什么型号

  

model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )

返回:

1)没有模特

2)在停止之前训练模型

3)之前的模型25次迭代(注意监视器的参数)

如果不是(3),是否可以使估算器返回3?

我该怎么做?

It is worth mentioning that xgboost library does that, however it does allow to use the loss function that I need.

1 个答案:

答案 0 :(得分:1)

模型在“停止规则”停止模型之前返回拟合 - 意味着你的答案No.2是正确的。

这个'监视器代码'的问题在于最终选择的模型将是包含25次额外迭代的模型。选择的模型应该是你的NO.3答案。

我认为这样做的简单(和愚蠢)方法是运行相同的模型(使用种子 - 具有相同的结果)但保持模型没有迭代等于(i - max_consecutive_decreases)