我是机器学习和插入符号的新手。我正在研究有监督的两类分类问题,而且我坚持进行预处理。特别是,我想集中,扩展,虚拟化和估算缺失的价值。我知道应该在训练和测试集上分别进行此操作。我的问题是:
1)应该在插补之前进行实体模型化吗?似乎这种方式否则将无法进行估算,并且anyNA()
将等于TRUE
;
2)结果是否也应虚拟化?如果是这样,当训练分类器时,我该如何在train()
函数中进行解释
3)预处理后(请参见下面的代码),分类变量不再是1和0的集合。这是正确的吗?这是一个示例:
set.seed(123)
df <- data.frame(
"var_ord" = c(rep("a",300),rep("b",500),rep("c",200)),
"var_cat" = c(rep("Male",800),rep("Female",200)),
"var_num1" = floor(runif(1000, min=0, max=101)),
"var_num2" = floor(runif(1000, min=0, max=101)),
"outcome" = c(rep("pos",300),rep("neg",700))
)
df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
index <- createDataPartition(df$outcome, p = .8, list = FALSE, times = 1)
train <- df[index,]
test <- df[-index,]
preProcValues <- preProcess(train, method = c("center", "scale"))
train <- predict(preProcValues, train)
dummy <- dummyVars(" ~ .", data = train)
train <- data.frame(predict(dummy, newdata = train))
preProcess_missingdata_model <- preProcess(train, method='knnImpute')
train <- predict(preProcess_missingdata_model, newdata = train)
anyNA(train)
head(train)
var_ord.a var_ord.b var_ord.c var_cat.Female var_cat.Male var_num1 var_num2
1 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 -0.69301258 -0.81979728
2 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 1.01365298 -0.28720490
4 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 1.35498610 1.22065403
5 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 1.52565265 1.18607011
6 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 -1.54634537 -0.09353495
7 1.549346 -1.005206 -0.5047202 -0.4949989 0.4949989 -0.09226631 0.94398267
outcome.neg outcome.pos
1 -1.526418 1.526418
2 -1.526418 1.526418
4 -1.526418 1.526418
5 -1.526418 1.526418
6 -1.526418 1.526418
7 -1.526418 1.526418
如果有人可以指出我是否做错了什么,我将不胜感激。谢谢!链接到端到端教程的链接也将不胜感激。