我正在尝试直接在SageMaker笔记本实例中使用pip install xgboost
安装的xgboost进行超参数调整。下面是我做超参数调整的代码:
def train_xgb(X_train, y_train):
"""
train an xgboosting model, optimize hyperparameters via grid search and return predicted results
:param df: input dataframe containing all train/val/test data
:return: trained model, predicted probability for y = 1 and predicted y label(0 for nopickup, 1 for pickup)
"""
# grid search to find the best hyperparameter
model = XGBClassifier()
learning_rate = [0.01] #0.001, 0.01, 0.1]
max_depth = [15, 30, 45]
n_estimators = [150, 200, 300]
reg_alpha = [0.01] #[0.01, 0.1, 0.5]
reg_lambda = [5] #[0.1, 0.5, 5]
min_child_weight = [0.5] #[0.01, 0.1, 0.5]
subsample = [0.5] #, 0.7, 0.9]
colsample_bytree = [0.7] #[0.7, 0.5, 0.9]
scale_pos_weight= [2]
param = dict(learning_rate=learning_rate, n_estimators=n_estimators, max_depth=max_depth, reg_alpha=reg_alpha,
reg_lambda=reg_lambda, min_child_weight=min_child_weight, subsample=subsample,
colsample_bytree=colsample_bytree, scale_pos_weight=scale_pos_weight)
grid_search = GridSearchCV(model, param_grid = param, scoring='roc_auc', cv=StratifiedKFold(10))
print('Time before grid search: ', datetime.now())
result = grid_search.fit(X_train, y_train)
print('Time after grid search: ', datetime.now())
# summarize results
print("Best %f using %s" % (result.best_score_, result.best_params_))
# for initial scan
# model = XGBClassifier(colsample_bytree=0.9, learning_rate=0.01, max_depth=15, min_child_weight=0.5, n_estimators=50,
# reg_alpha=0.01, reg_lambda=5, subsample=0.7, scale_pos_weight=2)
best_xgb = result.best_estimator_
best_xgb.fit(X_train, y_train)
return best_xgb
我的训练数据大约有200万行和160列,其中csv文件的大小约为0.5M KB(一点也不大)。工程数据帧的大小为1.5GB。我选择的笔记本实例是ml.c5.18xlarge。但是,该代码将在grid_search.fit(X_train, y_train)
步骤上运行约2个小时(我为每个参数设置了2个参数和3个值,因此总共只有9个组合),并且停止运行而没有任何警告/错误输出。
这似乎是一种罕见的情况,因为当我问周围时很少有人遇到它。但是,没有输出使调试变得困难。我想知道1)花这么长时间跑步是否正常? 2)任何提示为什么会发生这种情况以及如何调试?提前非常感谢!