AWS SageMaker停止运行而没有错误/警告输出

时间:2019-08-16 21:36:06

标签: python xgboost amazon-sagemaker hyperparameters

我正在尝试直接在SageMaker笔记本实例中使用pip install xgboost安装的xgboost进行超参数调整。下面是我做超参数调整的代码:

def train_xgb(X_train, y_train):
    """
    train an xgboosting model, optimize hyperparameters via grid search and return predicted results
    :param df: input dataframe containing all train/val/test data
    :return: trained model, predicted probability for y = 1 and predicted y label(0 for nopickup, 1 for pickup)
    """
    # grid search to find the best hyperparameter

    model = XGBClassifier()
    learning_rate =  [0.01] #0.001, 0.01, 0.1]
    max_depth = [15, 30, 45]
    n_estimators = [150, 200, 300]
    reg_alpha = [0.01] #[0.01, 0.1, 0.5]
    reg_lambda = [5] #[0.1, 0.5, 5]
    min_child_weight = [0.5] #[0.01, 0.1, 0.5]
    subsample = [0.5] #, 0.7, 0.9]
    colsample_bytree = [0.7] #[0.7, 0.5, 0.9]
    scale_pos_weight= [2]
    param = dict(learning_rate=learning_rate, n_estimators=n_estimators, max_depth=max_depth, reg_alpha=reg_alpha,
                      reg_lambda=reg_lambda, min_child_weight=min_child_weight, subsample=subsample,
                      colsample_bytree=colsample_bytree, scale_pos_weight=scale_pos_weight)

    grid_search = GridSearchCV(model, param_grid = param, scoring='roc_auc', cv=StratifiedKFold(10))
    print('Time before grid search: ', datetime.now())
    result = grid_search.fit(X_train, y_train)
    print('Time after grid search: ', datetime.now())
    # summarize results
    print("Best %f using %s" % (result.best_score_, result.best_params_))

    # for initial scan
    # model = XGBClassifier(colsample_bytree=0.9, learning_rate=0.01, max_depth=15, min_child_weight=0.5, n_estimators=50,
    # reg_alpha=0.01, reg_lambda=5, subsample=0.7, scale_pos_weight=2)

    best_xgb = result.best_estimator_
    best_xgb.fit(X_train, y_train)

    return best_xgb

我的训练数据大约有200万行和160列,其中csv文件的大小约为0.5M KB(一点也不大)。工程数据帧的大小为1.5GB。我选择的笔记本实例是ml.c5.18xlarge。但是,该代码将在grid_search.fit(X_train, y_train)步骤上运行约2个小时(我为每个参数设置了2个参数和3个值,因此总共只有9个组合),并且停止运行而没有任何警告/错误输出。

这似乎是一种罕见的情况,因为当我问周围时很少有人遇到它。但是,没有输出使调试变得困难。我想知道1)花这么长时间跑步是否正常? 2)任何提示为什么会发生这种情况以及如何调试?提前非常感谢!

0 个答案:

没有答案