我阅读了与挂起的GridSearchCV()一起使用的有关LightGBM的以前的文章,并相应地更正了我的代码。但是代码似乎仍然挂了> 3个小时!
我有一个8 GB的RAM,数据有29802行和13个列。大多数cols是分类值,已被标记为数字。
请参见下面的代码。等待您的宝贵建议!
最初,我使用lgb.train()获得了89%的AUC。
但是使用LGBMClassifier()之后,我无处可去。 因此,我选择了GridSearchCV()。
我需要LGBMClassifier(),因为我想要score()和其他使用lgb.train()时找不到的简单包装器。
我现在评论了大多数参数设置。但是网格搜索似乎还没有结束:(
X和y是我完整的训练数据集:
params = {'boosting_type': 'gbdt',
'max_depth' : 15,
'objective': 'binary',
#'nthread': 1, # Updated from nthread
'num_leaves': 30,
'learning_rate': 0.001,
#'max_bin': 512,
#'subsample_for_bin': 200,
'subsample': 0.8,
'subsample_freq': 500,
#'colsample_bytree': 0.8,
#'reg_alpha': 5,
#'reg_lambda': 10,
#'min_split_gain': 0.5,
#'min_child_weight': 1,
#'min_child_samples': 5,
#'scale_pos_weight': 1,
#'num_class' : 1,
'metric' : 'roc_auc',
'early_stopping' : 10,
'n_jobs': 1,
}
gridParams = {
'learning_rate': [0.001,0.01],
'n_estimators': [ 1000],
'num_leaves': [12, 30,80],
'boosting_type' : ['gbdt'],
'objective' : ['binary'],
'random_state' : [1], # Updated from 'seed'
'colsample_bytree' : [ 0.8, 1],
'subsample' : [0.5,0.7,0.75],
'reg_alpha' : [0.1, 1.2],
'reg_lambda' : [0.1, 1.2],
'subsample_freq' : [500,1000],
'max_depth' : [15, 30, 80]
}
mdl = lgb.LGBMClassifier(**params)
grid = GridSearchCV(mdl, gridParams,return_train_score=True,
verbose=1,
cv=4,
n_jobs=1, #only '1' will work
scoring='roc_auc'
)
grid.fit(X=X, y=y,eval_set=[[X,y]],early_stopping_rounds=10) # never ending code
Output:
Fitting 4 folds for each of 864 candidates, totalling 3456 fits
[1] valid_0's binary_logloss: 0.686044
Training until validation scores don't improve for 10 rounds.
[2] valid_0's binary_logloss: 0.685749
[3] valid_0's binary_logloss: 0.685433
[4] valid_0's binary_logloss: 0.685134
[5] valid_0's binary_logloss: 0.684831
[6] valid_0's binary_logloss: 0.684517
[7] valid_0's binary_logloss: 0.684218
[8] valid_0's binary_logloss: 0.683904
[9] valid_0's binary_logloss: 0.683608
[10] valid_0's binary_logloss: 0.683308
[11] valid_0's binary_logloss: 0.683009
[12] valid_0's binary_logloss: 0.68271
[13] valid_0's binary_logloss: 0.682416
[14] valid_0's binary_logloss: 0.682123
[15] valid_0's binary_logloss: 0.681814
[16] valid_0's binary_logloss: 0.681522
[17] valid_0's binary_logloss: 0.681217
[18] valid_0's binary_logloss: 0.680922
[19] valid_0's binary_logloss: 0.680628
[20] valid_0's binary_logloss: 0.680322
[21] valid_0's binary_logloss: 0.680029
[22] valid_0's binary_logloss: 0.679736
[23] valid_0's binary_logloss: 0.679443
[24] valid_0's binary_logloss: 0.679151
[25] valid_0's binary_logloss: 0.678848
[26] valid_0's binary_logloss: 0.678546
[27] valid_0's binary_logloss: 0.678262
[28] valid_0's binary_logloss: 0.677974
[29] valid_0's binary_logloss: 0.677675
[30] valid_0's binary_logloss: 0.677393
[31] valid_0's binary_logloss: 0.677093........................
.....................
[997] valid_0's binary_logloss: 0.537612
[998] valid_0's binary_logloss: 0.537544
[999] valid_0's binary_logloss: 0.537481
[1000] valid_0's binary_logloss: 0.53741
Did not meet early stopping. Best iteration is:
[1000] valid_0's binary_logloss: 0.53741
................................ and it goes on and on ...............
请帮助!
问候 谢林
答案 0 :(得分:0)
您的问题与前述的问题不同。您训练了很多(3456)个分类器,这些分类器包含许多(1000)个非常深的(叶子13..80)树。因此,培训时间非常很长。解决方案是使树的深度更适中(最实用的方法是将深度固定为-1并在网格搜索中更改叶子的数量,因为您的数据集大小可能在10到40个叶子之间?),或减少网格点的数量(860个网格点是一个地段),或通过将树的数量从1000减少到100(随机选择)或减少树的数量(=重复次数),或者通过早停下来。
一个明显的问题:没有必要将训练数据(X,y
)用于早期停止标准(eval_set=[[X,y]],early_stopping_rounds=10
),目标功能将得到无限优化,并且达到仅最大迭代次数(您的情况下为1000棵树)。