我的数据框如下:
> dput(data)
structure(list(Comments = c("This is good", "What is the price", "You are no good", "help the needy", "What a beautiful day", "how can I help you", "You are my best friend", "she is my friend", "which one is the best", "How can she do that"
), ID = c("A1", "B2", "A1", "C3", "D4", "C3", "E5", "E5", "E5",
"E5")), class = "data.frame", row.names = c(NA, 10L))
基于唯一ID,我想获取每个组中的所有常见字符值。
从建议中,我尝试了以下代码
check <- aggregate(Comments ~ ID, demo, function(x){
temp = table(unlist(lapply(strsplit(x, ","), unique)))
temp = names(temp)[which(temp == max(temp) & temp > 1)]
if (length(temp) == 0) temp = ""
temp
})
这给出了唯一的ID,但显示了常见单词的空行
demo %>%
mutate(Words = strsplit(Comments, " ")) %>%
unnest %>%
intersect(Comments) %>%
group_by(ID, Comments) %>%
summarise(Words = toString(Comments))
这给了我错误。
我的预期输出是:
ID Comments
A1 "good"
B2 ""
C3 "help"
D4 ""
E5 "best, friend, she, is, my"
提前谢谢!
答案 0 :(得分:0)
使用dplyr
,我们可以使用row_number()
创建一列以获取每个ID
中的常用词。我们使用tidyr::separate_rows
将单词分成不同的行,filter
仅将出现在多行Comments
group_by
中的ID
并用逗号分隔串。
library(dplyr)
data %>%
mutate(row = row_number(),
ID = factor(ID)) %>%
tidyr::separate_rows(Comments, sep = "\\s+") %>%
group_by(ID, Comments) %>%
filter(n_distinct(row) > 1) %>%
group_by(ID, .drop = FALSE) %>%
summarise(Comments = toString(unique(Comments)))
# ID Comments
# <fct> <chr>
#1 A1 good
#2 B2 ""
#3 C3 help
#4 D4 ""
#5 E5 my, best, friend, she, is
答案 1 :(得分:0)
有了dplyr
,我们可以做到
library(tidyverse)
data %>%
separate_rows(Comments) %>%
count(Comments, ID) %>%
filter(n == max(n)) %>%
select(-n) %>%
complete(ID = unique(data$ID), fill = list(Comments = "")) %>%
group_by(ID) %>%
summarise(Comments = toString(Comments))
# A tibble: 5 x 2
# ID Comments
# <chr> <chr>
#1 A1 good
#2 B2 ""
#3 C3 help
#4 D4 ""
#5 E5 best, friend, is, my, she