预处理具有混合变量的数据框

时间:2019-07-28 20:50:36

标签: r r-caret preprocessor

我是机器学习和插入符号的新手。我正在研究有监督的两类分类问题,而且我坚持进行预处理。特别是,我想集中,扩展,虚拟化和估算缺失的价值。我知道应该在训练和测试集上分别进行此操作。我的问题是:

1)应该在插补之前进行实体模型化吗?似乎这种方式否则将无法进行估算,并且anyNA()将等于TRUE

2)结果是否也应虚拟化?如果是这样,当训练分类器时,我该如何在train()函数中进行解释

3)预处理后(请参见下面的代码),分类变量不再是1和0的集合。这是正确的吗?这是一个示例:

set.seed(123)

df <- data.frame(
    "var_ord" = c(rep("a",300),rep("b",500),rep("c",200)), 
    "var_cat" = c(rep("Male",800),rep("Female",200)), 
    "var_num1" = floor(runif(1000, min=0, max=101)), 
    "var_num2" = floor(runif(1000, min=0, max=101)), 
    "outcome" = c(rep("pos",300),rep("neg",700))
)

df <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))

index <- createDataPartition(df$outcome, p = .8, list = FALSE, times = 1)
train <- df[index,]
test  <- df[-index,]

preProcValues <- preProcess(train, method = c("center", "scale"))
train <- predict(preProcValues, train)

dummy <- dummyVars(" ~ .", data = train)
train <- data.frame(predict(dummy, newdata = train))

preProcess_missingdata_model <- preProcess(train, method='knnImpute')
train <- predict(preProcess_missingdata_model, newdata = train)

anyNA(train)
head(train)

var_ord.a var_ord.b  var_ord.c var_cat.Female var_cat.Male    var_num1    var_num2
1  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989 -0.69301258 -0.81979728
2  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989  1.01365298 -0.28720490
4  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989  1.35498610  1.22065403
5  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989  1.52565265  1.18607011
6  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989 -1.54634537 -0.09353495
7  1.549346 -1.005206 -0.5047202     -0.4949989    0.4949989 -0.09226631  0.94398267
  outcome.neg outcome.pos
1   -1.526418    1.526418
2   -1.526418    1.526418
4   -1.526418    1.526418
5   -1.526418    1.526418
6   -1.526418    1.526418
7   -1.526418    1.526418

如果有人可以指出我是否做错了什么,我将不胜感激。谢谢!链接到端到端教程的链接也将不胜感激。

0 个答案:

没有答案