一个热编码R中的数据帧

时间:2016-06-21 13:16:45

标签: r dataframe packages data-cleaning

考虑一个类似于显示的数据框df1

ID EDUCATION   OCCUPATION      BINARY_VAR
1  Undergrad   Student              1
2  Grad        Business Owner       1
3  Undergrad   Unemployed           0
4  PhD         Other                1

您可以使用下面的R代码

创建自己的随机df1
ID <- c(1:4)
EDUCATION <- sample (c('Undergrad', 'Grad', 'PhD'), 4, rep = TRUE)
OCCUPATION <- sample (c('Student', 'Business Owner', 'Unemployed', 'Other'), 4, rep = FALSE)
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

由此,我需要派生另一个看起来像这样的数据框df2

ID Undergrad Grad PhD Student Business Owner Unemployed Other BINARY_VAR
1      1      0    0     1           0           0        0       1
2      1      1    0     0           1           0        0       1
3      1      0    0     0           0           1        0       0
4      1      1    1     0           0           0        1       1

您必须注意到级别PhD的方式,EDUCATION下的其他因素级别也适用,因为EDUCATIONID的最高教育级别。然而,这是次要目标。

我似乎无法找出一种方法获取数据框,每列提供与其父数据框中的各个因子级别对应的真值。 R 中的包可以提供帮助吗?或者也许是一种编码方式?

我可以使用melt吗?

我通过previously asked question(s)阅读看起来类似的内容,但是它们会处理发生的频率。

编辑:

根据Sumedh的建议,一种方法是使用dummyVars中的caret

dummies <- dummyVars(ID ~ ., data = df1)
df2 <- data.frame(predict(dummies, newdata = df1))
df2 <- df2 [1:7]

1 个答案:

答案 0 :(得分:0)

tidyrdplyr结合base table()函数应该有效:

ID <- c(1:4)
EDUCATION <- c('Undergrad', 'Grad', 'PhD', 'Undergrad')
OCCUPATION <- c('Student', 'Business Owner', 'Unemployed', 'Other')
BINARY_VAR <- sample(c(0, 1), 4, rep = TRUE)
df1 <- data.frame(ID, EDUCATION, OCCUPATION, BINARY_VAR)

# Convert to factor
df1[, names(df1)] <- lapply(df1[, names(df1)] , factor)

library(dplyr)
library(tidyr)

edu<-as.data.frame(table(df1[,1:2])) %>% spread(EDUCATION, Freq)

for(i in 1:nrow(edu))
  if(edu[i,]$PhD == 1) 
    edu[i,]$Undergrad <-edu[i,]$Grad <-1

truth_table<-merge(edu,
      as.data.frame(table(df1[,c(1,3)])) %>% spread(OCCUPATION, Freq),
      by = "ID")

truth_table$BINARY_VAR<-df1$BINARY_VAR
truth_table

ID Grad PhD Undergrad Business Owner Other Student Unemployed BINARY_VAR
1    0   0         1              0     0       1          0          1
2    1   0         0              1     0       0          0          1
3    1   1         1              0     0       0          1          0
4    0   0         1              0     1       0          0          1

修改:添加了一个for循环来更新受{Sumedh先前建议启发的PhD下的教育水平。