在具有已保存索引的LightGBM回归器上进行n倍交叉验证时,错误最初会减小,但随后突然上升。这仅在第一次折叠时发生,并且仅在我使用称为train_dev_kfolds
的情况下发生,而不是在使用kfolds
的情况下发生。这是输出:
Image of error
我已经对照数据框的索引检查了kfold,尽管似乎一切都井井有条:索引是互斥的,并且是整体穷举的(就数据框而言)。重新生成kfold仍然会产生相同的问题。最奇怪的是,它适合其他折叠以及数据帧子集(例如kfolds
)上的其他拆分。使用LGBRegressor(sklearn样式的API)不会导致此问题,例如使用model = LGBMRegressor(n_estimators=500
).fit(X_train, y_train,eval_set=[(X_dev, y_dev)],eval_metric='rmse',early_stopping_rounds=200,verbose=True)
可以正常输出。
什么可能导致此相当神秘的错误? train_idx.union(dev_idx) 从lightgbm导入LGBMRegressor
train_idx = folds[0][0]
dev_idx = folds[0][1]
inc_cols, key_col = pp.get_cols(settings,'model1',df)
X_train = df.loc[train_idx,inc_cols]
X_dev = df.loc[dev_idx,inc_cols]
y_train = df.loc[train_idx,key_col]
y_dev = df.loc[dev_idx,key_col]
lgb_train = lgb.Dataset(X_train,
label=y_train,
free_raw_data=False)
lgb_test = lgb.Dataset(X_dev,
label=y_dev,
free_raw_data=False)
model = lgb.train(
params,
lgb_train,
valid_sets=[lgb_train, lgb_test],
valid_names=['train', 'test'],
num_boost_round=200,
early_stopping_rounds= 200,
verbose_eval=10
)
preds = model.predict(X_dev)
display(mean_squared_error(y_dev, preds)**.5)
这是用于保存索引的功能:
def set_kf(root_train_df, root_test_df, settings):
n_folds = settings['N_FOLDS']
train_idx, dev_idx, test_idx = get_train_dev_test_idx(root_train_df, root_test_df, settings)
kf = KFold(n_splits=n_folds,shuffle=True,random_state=42)
kfolds, train_dev_kfolds = [],[] #get n_splits may be better for fine control
#split on train only
for kf_train_idx, kf_test_idx in kf.split(train_idx):
kfolds.append((train_idx[kf_train_idx], train_idx[kf_test_idx]))
#split on union of train and dev
if dev_idx is not None:
train_dev_idx = train_idx.union(dev_idx)
for kf_train_idx, kf_test_idx in kf.split(train_dev_idx):
train_dev_kfolds.append((train_dev_idx[kf_train_idx], train_dev_idx[kf_test_idx]))
else:
train_dev_kfolds = None
data = {
'train_idx': train_idx
,'dev_idx': dev_idx
,'test_idx': test_idx
,'kfolds': kfolds
,'train_dev_kfolds': train_dev_kfolds
,'kf': kf
}
with open(settings['cvs'], 'wb') as f:
#pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
del kfolds, train_dev_kfolds, data
gc.collect()
以下是用于回归的参数:
params ={
'task': 'train',
'boosting': 'goss',
'objective': 'regression',
'metric': 'rmse',
'learning_rate': 0.01,
'subsample': 0.9855232997390695,
'max_depth': 7,
'top_rate': 0.9064148448434349,
'num_leaves': 63,
'min_child_weight': 41.9612869171337,
'other_rate': 0.0721768246018207,
'reg_alpha': 9.677537745007898,
'colsample_bytree': 0.5665320670155495,
'min_split_gain': 9.820197773625843,
'reg_lambda': 8.2532317400459,
'min_data_in_leaf': 21,
'verbose': 10,
'seed':42,
'bagging_seed':42,
'drop_seed':42,
'device':'gpu'
}
编辑:在相同的脚本中创建拆分似乎很好,并且不会产生此问题-这是否是泡菜的问题?
编辑2:更奇怪的是,当我从泡菜对象中加载train_idx
和dev_idx
并对它们的并集(使用train_dev_idx = train_idx.union(dev_idx)
)执行拆分时,会出现相同的问题。但是,如果我从现有数据帧train_dev_idx2 = train_dev_df.index
获取联合索引,那就很好。而且,当我将两者与train_dev_idx.equals(train_dev_idx2)
进行比较时,结果为False,但是当我求和时,结果为空索引。
编辑3:我将问题缩小到train_idx.union(dev_idx)
。当我将这行替换为train_dev_idx=root_train_df.index
时,一切似乎都正常。但是,有人可以帮助我理解为什么会是这种情况吗?索引中的所有元素似乎都相等。此外,为什么这会使LightGBM的错误激增?