我有一个大型数据集,我使用人类可读和机器可读的标识符进行编码。我只想输入人类可读的代码,并在R中使用合并来添加机器可读的代码。唯一的问题是我在列中添加多个标识符,以逗号分隔。看起来有点像这样:
df <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), data=c(1,2,3,4,5)))
codebook <- as.data.frame(cbind(id=c("a","b", "c","d"),code=c('9999','8888','7777','6666')))
我想在这方面得到的结果如下:
answer <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), code=c('9999', '9999, 8888', '8888', '8888, 7777', '7777'), data=c(1,2,3,4,5)))
我在dplyr中尝试过单独的()和unite(),但我想知道是否有更简单的方法。
答案 0 :(得分:0)
这并没有给出你的确切输出,但它可能更容易使用(如果你喜欢Wickham的措辞,它会更“整洁”):
df %>%
mutate(new_1 = gsub("(.*)[, ](.*)", "\\1", identifier),
new_2 = gsub("(.*)[, ](.*)", "\\2", identifier)) %>%
mutate(new_2 = ifelse(new_1 == new_2, NA, new_2)) %>%
select(data, new_1, new_2) %>%
melt("data") %>%
inner_join(codebook, by = c("value" = "id"))
# data variable value code
# 1 1 new_1 a 9999
# 2 2 new_1 a 9999
# 3 3 new_1 b 8888
# 4 4 new_1 b 8888
# 5 5 new_1 c 7777
# 6 2 new_2 b 8888
# 7 4 new_2 c 7777
答案 1 :(得分:0)
试试separate_rows
。首先将因子列转换为字符。然后使用separate_rows
取消df
,将其加入代码簿并转换回来。请注意,结果具有字符列。
library(dplyr)
library(tidyr)
df %>%
mutate_all(as.character) %>%
separate_rows(identifier) %>%
left_join(codebook %>% mutate_all(as.character), by = c("identifier" = "id")) %>%
group_by(data) %>%
summarize(identifier = toString(identifier), code = toString(code)) %>%
ungroup
,并提供:
# A tibble: 5 x 3
data identifier code
<chr> <chr> <chr>
1 1 a 9999
2 2 a, b 9999, 8888
3 3 b 8888
4 4 b, c 8888, 7777
5 5 c 7777