GridSearchCV - XGBoost - 提前停止

时间:2017-03-24 07:15:03

标签: python-3.x scikit-learn regression data-science xgboost

我正尝试在XGBoost上使用scikit-learn的GridSearchCV进行超级计量表搜索。在网格搜索期间,我希望它能够提前停止,因为它可以大大缩短搜索时间并且(期望)在我的预测/回归任务上有更好的结果。我通过其Scikit-Learn API使用XGBoost。

    model = xgb.XGBRegressor()
    GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)

我尝试使用fit_params提供早期停止参数,但之后会抛出此错误,这主要是因为缺少早期停止所需的验证集:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
    187         else:
    188             assert env.cvfolds is not None
    189 
    190     def callback(env):
    191         """internal function"""
--> 192         score = env.evaluation_result_list[-1][1]
        score = undefined
        env.evaluation_result_list = []
    193         if len(state) == 0:
    194             init(env)
    195         best_score = state['best_score']
    196         best_iteration = state['best_iteration']

如何使用early_stopping_rounds在XGBoost上应用GridSearch?

注意:模型在没有gridsearch的情况下工作,GridSearch的工作也没有&#39; fit_params = {&#39; early_stopping_rounds&#39;:42}

2 个答案:

答案 0 :(得分:13)

使用early_stopping_rounds时,您还必须提供eval_metriceval_set作为fit方法的输入参数。通过计算评估集上的误差来完成早期停止。错误必须每early_stopping_rounds减少,否则会提前停止生成其他树。

有关详细信息,请参阅xgboosts fit方法的documentation

在这里,您可以看到一个最小的完整工作示例:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
         fit_params=fit_params,
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)

答案 1 :(得分:4)

从sklearn 0.21.3开始,更新@glao的答案和对@Vasim的评论/问题的回复(请注意,fit_params已从GridSearchCV的实例化中移出,并移入了fit()方法;同样,导入专门从xgboost提取sklearn包装模块):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)