我注意到,在使用Broom的增强功能时,新创建的数据框具有比我最初开始的更多的行。例如
# Statistical Modeling
## dummy vars
library(tidyverse)
training_data <- mtcars
dummy <- caret::dummyVars(~ ., data = training_data, fullRank = T, sep = ".")
training_data <- predict(dummy, mtcars) %>% as.data.frame()
clean_names <- names(training_data) %>% str_replace_all(" |`", "")
names(training_data) <- clean_names
## make target a factor
target <- training_data$mpg
target <- ifelse(target < 20, 0,1) %>% as.factor() %>% make.names()
## custom evaluation metric function
my_summary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model)
c1 <- prSummary(data, lev, model)
out <- c(a1, b1, c1)
out}
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 3,
sampling = "up", # over sample due to inbalanced data
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, -mpg),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "AUC"
)
library(broom)
linear_augment <- augment(linear_model$finalModel)
现在,如果我查看新的增强数据框并与原始mtcar进行比较:
> nrow(mtcars)
[1] 32
> nrow(linear_augment)
[1] 36
期望值是32行而不是36行。为什么?
答案 0 :(得分:2)
您正在trainControl
通话中进行上采样,导致采样数超过原始数据集。
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 3,
# sampling = "up", # over sample due to inbalanced data
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, -mpg),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "AUC"
)
library(broom)
linear_augment <- augment(linear_model$finalModel)
注释升采样已被注释掉
> dim(linear_augment)
[1] 32 19