按一列用逗号分隔的字符串分组,但是分组应忽略字符串的特定顺序

时间:2019-03-04 23:03:00

标签: r dplyr

假设我有以下数据

> summary_table[, c('condition_list', 'condition_count')]
# A tibble: 4,306 x 2
   condition_list             condition_count
   <chr>                                <int>
 1 true control,control email               2
 2 true control,control email               1
 3 treatment, control email                 1
 4 true control, control email              1
 5 control email, true control              1
 6 control email                            1
 7 control email, treatment                 1
 8 control email,true control               2
 9 treatment                                1
10 control email, true control              1

请注意,“ condition_list”列由逗号限制的字符串组成,这些字符串指示对某些条件的分配,但是其中一些分配是同构的。我想对每种情况下的行数进行统计:

summary_table %>% group_by(condition_list) %>%
  summarize(n= n())

但是,这会将condition_list的每个特定组合视为一个单独的组。我希望它将“控制电子邮件,真正的控制”与“控制电子邮件,真正的控制”相同。最好的方法是什么?

> dput(dputter)
structure(list(condition_list = c("true control,control email", 
"true control", "treatment", "true control", "control email", 
"control email", "control email", "control email,true control", 
"treatment", "control email", "true control,treatment", "treatment,true control", 
"treatment,true control,control email", "control email", "treatment", 
"true control,control email", "control email", "treatment", "true control,treatment", 
"control email", "control email,true control", "treatment", "control email", 
"control email", "control email,true control", "control email", 
"control email", "true control", "treatment", "true control", 
"treatment", "true control", "true control", "control email", 
"true control", "control email", "control email", "true control", 
"treatment", "treatment,true control,control email", "true control", 
"true control", "treatment,control email", "true control", "true control", 
"control email", "control email", "treatment", "control email", 
"true control"), condition_count = c(2L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 3L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -50L))

2 个答案:

答案 0 :(得分:2)

这是一个整洁的解决方案:

library(tidyverse)

summary_table %>% 
  mutate(condition_list = 
           strsplit(condition_list, ",") %>% 
           map(sort) %>% 
           map_chr(paste, collapse = ",")
         ) %>%
  group_by(condition_list) %>% 
  tally()
# A tibble: 7 x 2
#  condition_list                           n
#  <chr>                                <int>
#1 control email                           17
#2 control email,treatment                  1
#3 control email,treatment,true control     2
#4 control email,true control               5
#5 treatment                                9
#6 treatment,true control                   3
#7 true control                            13

答案 1 :(得分:1)

你的意思是这样吗?

dputter %>%
    mutate(condition_list = str_split(condition_list, ",")) %>%
    unnest() %>%
    group_by(condition_list) %>%
    tally()
## A tibble: 3 x 2
#  condition_list     n
#  <chr>          <int>
#1 control email     25
#2 treatment         15
#3 true control      23

说明:我们可以使用separate(或在基数R str_split中)代替strsplit来拆分","上的条目,从而产生一个list列,然后unnest,然后进行总结。