在逗号分隔的列上连接两个数据集

时间:2018-02-22 19:12:00

标签: r dplyr

我有一个大型数据集,我使用人类可读和机器可读的标识符进行编码。我只想输入人类可读的代码,并在R中使用合并来添加机器可读的代码。唯一的问题是我在列中添加多个标识符,以逗号分隔。看起来有点像这样:

df <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), data=c(1,2,3,4,5)))

codebook <- as.data.frame(cbind(id=c("a","b", "c","d"),code=c('9999','8888','7777','6666')))

我想在这方面得到的结果如下:

 answer <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), code=c('9999', '9999, 8888', '8888', '8888, 7777', '7777'), data=c(1,2,3,4,5)))

我在dplyr中尝试过单独的()和unite(),但我想知道是否有更简单的方法。

2 个答案:

答案 0 :(得分:0)

这并没有给出你的确切输出,但它可能更容易使用(如果你喜欢Wickham的措辞,它会更“整洁”):

df %>%
  mutate(new_1 = gsub("(.*)[, ](.*)", "\\1", identifier),
         new_2 = gsub("(.*)[, ](.*)", "\\2", identifier)) %>%
  mutate(new_2 = ifelse(new_1 == new_2, NA, new_2)) %>%
  select(data, new_1, new_2) %>%
  melt("data") %>%
  inner_join(codebook, by = c("value" = "id"))

#   data variable value code
# 1    1    new_1     a 9999
# 2    2    new_1     a 9999
# 3    3    new_1     b 8888
# 4    4    new_1     b 8888
# 5    5    new_1     c 7777
# 6    2    new_2     b 8888
# 7    4    new_2     c 7777

答案 1 :(得分:0)

试试separate_rows。首先将因子列转换为字符。然后使用separate_rows取消df,将其加入代码簿并转换回来。请注意,结果具有字符列。

library(dplyr)
library(tidyr)

df %>%
   mutate_all(as.character) %>%
   separate_rows(identifier) %>% 
   left_join(codebook %>% mutate_all(as.character), by = c("identifier" = "id")) %>% 
   group_by(data) %>% 
   summarize(identifier = toString(identifier), code = toString(code)) %>%
   ungroup

,并提供:

# A tibble: 5 x 3
  data  identifier code      
  <chr> <chr>      <chr>     
1 1     a          9999      
2 2     a, b       9999, 8888
3 3     b          8888      
4 4     b, c       8888, 7777
5 5     c          7777