我定期在R中使用XGBoost,并希望开始在相同数据上使用LightGBM。我的目标是使用科恩的kappa作为评估指标。但是,我无法正确实现LightGBM-似乎没有学习发生。作为一个非常简单的示例,我将使用泰坦尼克号数据集。
library(data.table)
library(dplyr)
library(caret)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
ctrl <- trainControl(method = "none", search = "grid")
tune_grid_xgbTree <- expand.grid(
nrounds = 700,
eta = 0.1,
max_depth = 3,
gamma = 0,
colsample_bytree = 0,
min_child_weight = 1,
subsample = 1)
set.seed(512)
fit_xgb <- train(
x = train_numeric,
y = train_class,
tuneGrid = tune_grid_xgbTree,
trControl = ctrl,
method = "xgbTree",
metric = "Kappa",
verbose = TRUE)
confusionMatrix(predict(fit_xgb, train_numeric), train_class)
在训练集上给我的Kappa为0.57(这只是为了显示我的问题,否则我将使用交叉验证)。
对于LightGBM,我将Kappa编写为自定义评估功能:
library(lightgbm)
lgb.kappa <- function(preds, y) {
label <- getinfo(y, "label")
k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
dtrain <- lgb.Dataset(data = X_train, label = y_train)
在这里,我使用的参数集与XGBoost中的参数集相同,但是我尝试了不同的组合,但没有成功。
fit_lgbm <- lgb.train(data = dtrain,
objective = "binary",
learning_rate = 0.1,
nrounds = 700,
colsample_bytree = 0,
eval = lgb.kappa,
min_child_weight = 1,
max_depth = 3)
没有学习发生,算法输出“没有进一步的正增益分配,最佳增益:-inf”并且Kappa = 0。
如果有人没有成功实施LightGBM(也许具有自定义评估指标),那么我将很高兴为您提供解决方法的提示。
答案 0 :(得分:1)
没有学习,算法将输出“没有进一步的正增益分配,最佳增益:-inf”
这是因为为大型数据集配置了LightGBM's default parameter values。上面的示例中的训练数据集只有714行。为了解决这个问题,我建议将LightGBM的参数设置为允许较小叶节点的值,并限制叶数而不是深度。
list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
)
且Kappa = 0。
我相信您对Cohen的kappa的实现有误。 e1071::classAgreement()
的输入应该是 counts 的表(一个混淆矩阵),而preds
的形式是预测概率。我认为这种实现是正确的,基于the description of this metric on Wikipedia。
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
最后,对于700ish观测数据集,我认为700次迭代可能太多了。通过将训练数据作为验证集传递,您可以看到在每次迭代中针对训练数据评估的指标的价值。
总而言之,我认为下面的代码可以完成原始问题的要求。
library(data.table)
library(dplyr)
library(caret)
library(lightgbm)
titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")
titanic_complete <- titanic %>%
select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>%
mutate_if(is.character, as.factor) %>%
mutate(survived = as.factor(survived)) %>%
na.omit()
train_class <- titanic_complete %>%
select(survived) %>%
pull()
train_numeric <- titanic_complete %>%
select_if(is.numeric) %>%
data.matrix()
lgb.kappa <- function(preds, dtrain) {
label <- getinfo(dtrain, "label")
threshold <- 0.5
thresholded_preds <- as.integer(preds > threshold)
k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}
X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1
# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
data = lgb.Dataset(
data = X_train,
label = y_train
),
params = list(
"min_data_in_leaf" = 3
, "max_depth" = -1
, "num_leaves" = 8
),
objective = "binary",
learning_rate = 0.1,
nrounds = 10L,
verbose = 1L,
valids = list(
"train" = lgb.Dataset(
data = X_train,
label = y_train
)
),
eval = lgb.kappa,
)
# evaluate a custom function after training
fit_lgbm$eval_train(
feval = lgb.kappa
)