在用逗号分隔的字符向量中查找唯一值,然后进行一次热编码

时间:2018-10-03 16:43:46

标签: r split one-hot-encoding

基本上,我有一个用逗号分隔的字符串向量。我正在寻找使用字符串的唯一值进行一次编码的方法。我相信我必须先找到唯一值(用逗号分隔),然后才能用作一键编码之前的列,但是我不确定。例如,假设我有以下字符向量:

people_names
Bob,Megan,Mike,Sarah
Mike,Sarah
Megan,Sarah
Bob

我希望创建一个与该向量相对应的结果单编码数据帧,如下所示:

Bob   Megan   Mike   Sarah
  1       1      1       1
  0       0      1       1
  0       1      0       1
  1       0      0       0

感谢您的帮助。我真的很感激。

2 个答案:

答案 0 :(得分:2)

people_names = c("Bob,Megan,Mike,Sarah",
                 "Mike,Sarah",
                 "Megan,Sarah",
                 "Bob")

library(tidyverse)

data.frame(people_names) %>%                # create a dataframe
  mutate(id = row_number(),                 # add row id (useful for reshaping)
         value = 1) %>%                     # add a column of 1s to denote existence
  separate_rows(people_names) %>%           # create one row per name keeping relevant info
  spread(people_names, value, fill = 0) %>% # reshape
  select(-id)                               # remove row id

#   Bob Megan Mike Sarah
# 1   1     1    1     1
# 2   0     0    1     1
# 3   0     1    0     1
# 4   1     0    0     0

答案 1 :(得分:0)

或者,splitstackshape包中有一个帮助程序功能,您可能会觉得有用。输出是一个矩阵

splitstackshape:::charMat(strsplit(people_names, ","), fill = 0L)
#     Bob Megan Mike Sarah
#[1,]   1     1    1     1
#[2,]   0     0    1     1
#[3,]   0     1    0     1
#[4,]   1     0    0     0

在同一软件包中,您也可以尝试cSplit_e

library(splitstackshape)
out <- cSplit_e(
  data.frame(people_names),
  split.col = "people_names",
  sep = ",",
  mode = "binary",
  type = "character",
  fill = 0L,
  drop = TRUE
)
# remove prefix of column names
(out <- setNames(out, sub("people_names_", "", names(out), fixed = TRUE))) 

数据

people_names = c("Bob,Megan,Mike,Sarah",
                 "Mike,Sarah",
                 "Megan,Sarah",
                 "Bob")