如何使用科恩的kappa在R中评估LightGBM?

时间:2019-08-13 11:47:23

标签: r xgboost lightgbm

我定期在R中使用XGBoost,并希望开始在相同数据上使用LightGBM。我的目标是使用科恩的kappa作为评估指标。但是,我无法正确实现LightGBM-似乎没有学习发生。作为一个非常简单的示例,我将使用泰坦尼克号数据集。

library(data.table)
library(dplyr)
library(caret)

titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")

titanic_complete <- titanic %>%
   select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>% 
   mutate_if(is.character, as.factor) %>%
   mutate(survived = as.factor(survived)) %>% 
   na.omit()

train_class <- titanic_complete %>% 
   select(survived) %>% 
   pull()

train_numeric <- titanic_complete %>% 
   select_if(is.numeric) %>% 
   data.matrix()

ctrl <- trainControl(method = "none", search = "grid")

tune_grid_xgbTree <- expand.grid(
   nrounds = 700,
   eta = 0.1,
   max_depth = 3,
   gamma = 0,
   colsample_bytree = 0,
   min_child_weight = 1,
   subsample = 1)

 set.seed(512)
 fit_xgb <- train(
    x = train_numeric,
    y = train_class,
    tuneGrid = tune_grid_xgbTree,
    trControl = ctrl,
    method = "xgbTree",
    metric = "Kappa",
    verbose = TRUE)

 confusionMatrix(predict(fit_xgb, train_numeric), train_class)

在训练集上给我的Kappa为0.57(这只是为了显示我的问题,否则我将使用交叉验证)。

对于LightGBM,我将Kappa编写为自定义评估功能:

library(lightgbm)
lgb.kappa <- function(preds, y) {
   label <- getinfo(y, "label")
   k <- unlist(e1071::classAgreement(table(label, preds)))["kappa"]
   return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
 }

X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1

dtrain <- lgb.Dataset(data = X_train, label = y_train)

在这里,我使用的参数集与XGBoost中的参数集相同,但是我尝试了不同的组合,但没有成功。

fit_lgbm <- lgb.train(data = dtrain,
                  objective = "binary",
                  learning_rate = 0.1,
                  nrounds = 700,
                  colsample_bytree = 0,
                  eval = lgb.kappa,
                  min_child_weight = 1,
                  max_depth = 3)

没有学习发生,算法输出“没有进一步的正增益分配,最佳增益:-inf”并且Kappa = 0。

如果有人没有成功实施LightGBM(也许具有自定义评估指标),那么我将很高兴为您提供解决方法的提示。

1 个答案:

答案 0 :(得分:1)

没有学习,算法将输出“没有进一步的正增益分配,最佳增益:-inf”

这是因为为大型数据集配置了LightGBM's default parameter values。上面的示例中的训练数据集只有714行。为了解决这个问题,我建议将LightGBM的参数设置为允许较小叶节点的值,并限制叶数而不是深度。

list(
    "min_data_in_leaf" = 3
    , "max_depth" = -1
    , "num_leaves" = 8
)

且Kappa = 0。

我相信您对Cohen的kappa的实现有误。 e1071::classAgreement()的输入应该是 counts 的表(一个混淆矩阵),而preds的形式是预测概率。我认为这种实现是正确的,基于the description of this metric on Wikipedia

lgb.kappa <- function(preds, dtrain) {
    label <- getinfo(dtrain, "label")
    threshold <- 0.5
    thresholded_preds <- as.integer(preds > threshold)
    k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
    return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}

最后,对于700ish观测数据集,我认为700次迭代可能太多了。通过将训练数据作为验证集传递,您可以看到在每次迭代中针对训练数据评估的指标的价值。

总而言之,我认为下面的代码可以完成原始问题的要求。

library(data.table)
library(dplyr)
library(caret)
library(lightgbm)

titanic <- fread("https://raw.githubusercontent.com/pcsanwald/kaggle-titanic/master/train.csv")

titanic_complete <- titanic %>%
    select(survived, pclass, sex, age, sibsp, parch, fare, embarked) %>% 
    mutate_if(is.character, as.factor) %>%
    mutate(survived = as.factor(survived)) %>% 
    na.omit()

train_class <- titanic_complete %>% 
    select(survived) %>% 
    pull()

train_numeric <- titanic_complete %>% 
    select_if(is.numeric) %>% 
    data.matrix()

lgb.kappa <- function(preds, dtrain) {
    label <- getinfo(dtrain, "label")
    threshold <- 0.5
    thresholded_preds <- as.integer(preds > threshold)
    k <- unlist(e1071::classAgreement(table(label, thresholded_preds)))["kappa"]
    return(list(name = "kappa", value = as.numeric(k), higher_better = TRUE))
}

X_train <- titanic_complete %>% select(-survived) %>% data.matrix()
y_train <- titanic_complete %>% select(survived) %>% data.matrix()
y_train <- y_train - 1

# train, printing out eval metrics at ever iteration
fit_lgbm <- lgb.train(
    data = lgb.Dataset(
        data = X_train,
        label = y_train
    ),
    params = list(
        "min_data_in_leaf" = 3
        , "max_depth" = -1
        , "num_leaves" = 8
    ),
    objective = "binary",
    learning_rate = 0.1,
    nrounds = 10L,
    verbose = 1L,
    valids = list(
        "train" = lgb.Dataset(
            data = X_train,
            label = y_train
        )
    ),
    eval = lgb.kappa,
)

# evaluate a custom function after training
fit_lgbm$eval_train(
    feval = lgb.kappa
)