执行kfolds交叉验证时,Lightgbm的早期停止错误突然增加(由于熊猫索引联合)-可能是什么原因?

时间:2019-02-11 09:30:01

标签: python cross-validation lightgbm

在具有已保存索引的LightGBM回归器上进行n倍交叉验证时,错误最初会减小,但随后突然上升。这仅在第一次折叠时发生,并且仅在我使用称为train_dev_kfolds的情况下发生,而不是在使用kfolds的情况下发生。这是输出: Image of error

我已经对照数据框的索引检查了kfold,尽管似乎一切都井井有条:索引是互斥的,并且是整体穷举的(就数据框而言)。重新生成kfold仍然会产生相同的问题。最奇怪的是,它适合其他折叠以及数据帧子集(例如kfolds)上的其他拆分。使用LGBRegressor(sklearn样式的API)不会导致此问题,例如使用model = LGBMRegressor(n_estimators=500 ).fit(X_train, y_train,eval_set=[(X_dev, y_dev)],eval_metric='rmse',early_stopping_rounds=200,verbose=True)可以正常输出。

什么可能导致此相当神秘的错误? train_idx.union(dev_idx)     从lightgbm导入LGBMRegressor

train_idx = folds[0][0]
dev_idx = folds[0][1]

inc_cols, key_col = pp.get_cols(settings,'model1',df)
X_train = df.loc[train_idx,inc_cols]
X_dev = df.loc[dev_idx,inc_cols]
y_train = df.loc[train_idx,key_col]
y_dev = df.loc[dev_idx,key_col]
lgb_train = lgb.Dataset(X_train,
                        label=y_train,
                        free_raw_data=False)
lgb_test = lgb.Dataset(X_dev,
                       label=y_dev,
                       free_raw_data=False)

model = lgb.train(
                params,
                lgb_train,
                valid_sets=[lgb_train, lgb_test],
                valid_names=['train', 'test'],
                num_boost_round=200,
                early_stopping_rounds= 200,
                verbose_eval=10
                )
preds = model.predict(X_dev)
display(mean_squared_error(y_dev, preds)**.5)

这是用于保存索引的功能:

def set_kf(root_train_df, root_test_df, settings):
    n_folds = settings['N_FOLDS']
    train_idx, dev_idx, test_idx = get_train_dev_test_idx(root_train_df, root_test_df, settings)
    kf = KFold(n_splits=n_folds,shuffle=True,random_state=42)
    kfolds, train_dev_kfolds = [],[] #get n_splits may be better for fine control

    #split on train only
    for kf_train_idx, kf_test_idx in kf.split(train_idx):
        kfolds.append((train_idx[kf_train_idx], train_idx[kf_test_idx]))

    #split on union of train and dev
    if dev_idx is not None:
        train_dev_idx = train_idx.union(dev_idx)
        for kf_train_idx, kf_test_idx in kf.split(train_dev_idx):
            train_dev_kfolds.append((train_dev_idx[kf_train_idx], train_dev_idx[kf_test_idx]))
    else:
        train_dev_kfolds = None

    data = {
        'train_idx': train_idx
        ,'dev_idx': dev_idx
        ,'test_idx': test_idx
        ,'kfolds': kfolds
        ,'train_dev_kfolds': train_dev_kfolds
        ,'kf': kf
    }

    with open(settings['cvs'], 'wb') as f:
        #pickle the 'data' dictionary using the highest protocol available.
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)        

    del kfolds, train_dev_kfolds, data
    gc.collect()

以下是用于回归的参数:

params ={
        'task': 'train',
        'boosting': 'goss',
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.01,
        'subsample': 0.9855232997390695,
        'max_depth': 7,
        'top_rate': 0.9064148448434349,
        'num_leaves': 63,
        'min_child_weight': 41.9612869171337,
        'other_rate': 0.0721768246018207,
        'reg_alpha': 9.677537745007898,
        'colsample_bytree': 0.5665320670155495,
        'min_split_gain': 9.820197773625843,
        'reg_lambda': 8.2532317400459,
        'min_data_in_leaf': 21,
        'verbose': 10,
        'seed':42,
        'bagging_seed':42,
        'drop_seed':42,
        'device':'gpu'
        }

编辑:在相同的脚本中创建拆分似乎很好,并且不会产生此问题-这是否是泡菜的问题?

编辑2:更奇怪的是,当我从泡菜对象中加载train_idxdev_idx并对它们的并集(使用train_dev_idx = train_idx.union(dev_idx))执行拆分时,会出现相同的问题。但是,如果我从现有数据帧train_dev_idx2 = train_dev_df.index获取联合索引,那就很好。而且,当我将两者与train_dev_idx.equals(train_dev_idx2)进行比较时,结果为False,但是当我求和时,结果为空索引。

编辑3:我将问题缩小到train_idx.union(dev_idx)。当我将这行替换为train_dev_idx=root_train_df.index时,一切似乎都正常。但是,有人可以帮助我理解为什么会是这种情况吗?索引中的所有元素似乎都相等。此外,为什么这会使LightGBM的错误激增?

0 个答案:

没有答案