Question

我想对我的数据进行下采样，因为我有一个显着的类不平衡。没有下采样，我的GBM模型表现得相当不错;但是，使用r-caret的downSample，精度= 0.5。我将相同的下采样应用于另一个GBM模型，并得到完全相同的结果。是什么给了什么？

set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features, 
                                y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ ., 
                                     data = down_train_my_gbm, 
                                     method = "gbm",
                                     trControl = trainControl(method="repeatedcv", 
                                                 number=10, repeats=3,  
                                                 classProbs = TRUE),
                                     preProcess = c("range"),
                                     verbose = FALSE)

我怀疑这个问题可能与classProbs = TRUE有关。将此值更改为FALSE会使精度达到> 0.95 ......但我得到的结果与多个模型完全相同（如果没有下采样，则不会产生相同的精度）。我为此感到困惑。我在这里做错了什么？

Answer 1

Caret train功能允许使用trainControl选项进行下采样，上采样等操作：从指南Subsampling During Resampling，在您的情况下它将是

ctrl <- trainControl(method = "repeatedcv", repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     ## new option here:
                     sampling = "down")

model_with_down_sample <- train(Class ~ ., data = imbal_train,
                                method = "gbm",
                                preProcess = c("range"),
                                verbose = FALSE,
                                trControl = ctrl)

作为旁注，请避免使用公式样式（例如Class~。），但要使用直接列：当使用许多预测变量时，它已被证明存在内存和速度问题（https://github.com/topepo/caret/issues/263）。 / p>

希望它有所帮助。

如何使用r-caret进行下采样？

1 个答案: