假设我有以下数据
> summary_table[, c('condition_list', 'condition_count')]
# A tibble: 4,306 x 2
condition_list condition_count
<chr> <int>
1 true control,control email 2
2 true control,control email 1
3 treatment, control email 1
4 true control, control email 1
5 control email, true control 1
6 control email 1
7 control email, treatment 1
8 control email,true control 2
9 treatment 1
10 control email, true control 1
请注意,“ condition_list”列由逗号限制的字符串组成,这些字符串指示对某些条件的分配,但是其中一些分配是同构的。我想对每种情况下的行数进行统计:
summary_table %>% group_by(condition_list) %>%
summarize(n= n())
但是,这会将condition_list
的每个特定组合视为一个单独的组。我希望它将“控制电子邮件,真正的控制”与“控制电子邮件,真正的控制”相同。最好的方法是什么?
> dput(dputter)
structure(list(condition_list = c("true control,control email",
"true control", "treatment", "true control", "control email",
"control email", "control email", "control email,true control",
"treatment", "control email", "true control,treatment", "treatment,true control",
"treatment,true control,control email", "control email", "treatment",
"true control,control email", "control email", "treatment", "true control,treatment",
"control email", "control email,true control", "treatment", "control email",
"control email", "control email,true control", "control email",
"control email", "true control", "treatment", "true control",
"treatment", "true control", "true control", "control email",
"true control", "control email", "control email", "true control",
"treatment", "treatment,true control,control email", "true control",
"true control", "treatment,control email", "true control", "true control",
"control email", "control email", "treatment", "control email",
"true control"), condition_count = c(2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -50L))
答案 0 :(得分:2)
这是一个整洁的解决方案:
library(tidyverse)
summary_table %>%
mutate(condition_list =
strsplit(condition_list, ",") %>%
map(sort) %>%
map_chr(paste, collapse = ",")
) %>%
group_by(condition_list) %>%
tally()
# A tibble: 7 x 2
# condition_list n
# <chr> <int>
#1 control email 17
#2 control email,treatment 1
#3 control email,treatment,true control 2
#4 control email,true control 5
#5 treatment 9
#6 treatment,true control 3
#7 true control 13
答案 1 :(得分:1)
你的意思是这样吗?
dputter %>%
mutate(condition_list = str_split(condition_list, ",")) %>%
unnest() %>%
group_by(condition_list) %>%
tally()
## A tibble: 3 x 2
# condition_list n
# <chr> <int>
#1 control email 25
#2 treatment 15
#3 true control 23
说明:我们可以使用separate
(或在基数R str_split
中)代替strsplit
来拆分","
上的条目,从而产生一个list
列,然后unnest
,然后进行总结。