将一个因子转换为二元虚拟变量,但并非所有因子都存在

时间:2019-01-21 14:09:55

标签: r one-hot-encoding

我有许多数据帧,其中包含一个因子,希望将其扩展为许多二进制等价物(一种热编码)。但是,在每个数据帧中,并非所有可能的因素都存在,但是我确实知道所有可能的因素是什么(有70个这样的因素)。我想将所有可能的二进制虚拟变量添加到每个数据帧。

从下面的代码中,我可以在每个数据帧中创建虚拟对象,但不能创建所有可能的虚拟对象。例如,set1.df没有任何人属于“ E”或“ F”类别,而set2.df没有任何人属于“ D”类别。需要的是set1.df中的其他列set1.dfE set1.dfF均为0,set2.df中的列set2.dfD均为零。在创建虚拟变量之前,我无法rbind set1.df和set2.df,因为在绑定之前我需要使用二进制变量对每个数据帧进行一些处理。再次重申一下,我知道事前数据中可能存在哪些级别,例如“ A”至“ F”。

library(dummies)

person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
set1.df <- data.frame(person_id,person_cat)

person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
set2.df <- data.frame(person_id,person_cat)

dummies1 <- dummy(set1.df[,2])
dummies2 <- dummy(set2.df[,2])

dummies1
dummies2

预期输出为:

> dummies1
      set1.dfA set1.dfB set1.dfC set1.dfD set1.dfE set1.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        1        0        0
 [8,]        1        0        0        0        0        0
 [9,]        1        0        0        0        0        0
[10,]        1        0        0        0        0        0
> dummies2
      set2.dfA set2.dfB set2.dfC set2.df$D set2.dfE set2.dfF
 [1,]        1        0        0        0        0        0
 [2,]        0        1        0        0        0        0
 [3,]        0        0        1        0        0        0
 [4,]        1        0        0        0        0        0
 [5,]        0        1        0        0        0        0
 [6,]        0        0        1        0        0        0
 [7,]        0        0        0        0        1        0
 [8,]        0        0        0        0        1        0
 [9,]        0        0        0        0        0        1
[10,]        1        0        0        0        0        0

2 个答案:

答案 0 :(得分:0)

这是一种解决方案:

levels <- c('A', 'B', 'C', 'D', 'E', 'F')

data <- data.frame(matrix(NA, nrow = length(person_id), ncol = length(levels)))
names(data) <- levels 
for (i in 1:nrow(data)) {
  for (j in 1:length(data)){
    data[i, j] <- ifelse(set1.df[i, 2] == names(data)[j], 1, 0)
  }
}

您应该创建一个空的数据框,其行数与ID相同,列数与set1.df中的级别相同。然后,使用循环来评估每一列中的person_cat。仅当person_cat等于列名(category_level)时,单元格的值才为1。

答案 1 :(得分:0)

 library(dummies)

person_id <- c(1,2,3,4,5,6,7,8,9,10)
person_cat <- c("A","B","C","A","B","C","D","A","A","A")
person_cat < -factor(person_cat,levels=c("A","B","C","D","E","F"))
set1.df <- data.frame(person_id,person_cat)

person_id <- c(11,12,13,14,15,16,17,18,19,20)
person_cat <- c("A","B","C","A","B","C","E","E","F","A")
person_cat <- factor(person_cat,levels=c("A","B","C","D","E","F"))
set2.df <- data.frame(person_id,person_cat)

dummies1 <- dummy(set1.df[,2],drop=FALSE)
dummies2 <- dummy(set2.df[,2],drop=FALSE)

dummies1
dummies2