计算R中各组之间的组合

时间:2018-11-14 17:12:10

标签: r dplyr tidyverse tidyr data-manipulation

我的数据设置为

df=data.frame(ID=c('A', 'A','A','B','B','C','C','C', 'C', 'C','D', 'E', 'E'),
                    drink_freq = c('Coffee Light', 'Water Heavy', 'Tea Medium',
                                   'Coffee Medium', 'Water Light', 
                                   'Espresso Light', 'Coffee Medium', 'Water Light', 'Soda Light', 'Tea Medium',
                                   'Coffee Heavy',
                                   'Coffee Medium', 'Soda Light'))

我想做的是创建某种列联表,该表显示用户可能属于的不同段的组合的频率。因此,例如...苏打轻型咖啡中型和咖啡中型水轻型为2,而轻型咖啡水重型为1。

我觉得这并不困难,但是我很难编写代码来执行此操作,因为用户可以属于不同数量的组。

1 个答案:

答案 0 :(得分:0)

这是一个tidyverse解决方案,它创建饮料的所有唯一组合(即考虑饮料的顺序)并计算他们拥有多少普通用户:

df=data.frame(ID=c('A', 'A','A','B','B','C','C','C', 'C', 'C','D', 'E', 'E'),
              drink_freq = c('Coffee Light', 'Water Heavy', 'Tea Medium',
                             'Coffee Medium', 'Water Light', 
                             'Espresso Light', 'Coffee Medium', 'Water Light', 'Soda Light', 'Tea Medium',
                             'Coffee Heavy',
                             'Coffee Medium', 'Soda Light'), stringsAsFactors = F)

library(tidyverse)

data.frame(t(combn(unique(df$drink_freq), 2)), stringsAsFactors = F) %>%
  mutate(counts = map2_dbl(X1, X2, ~length(intersect(df$ID[df$drink_freq==.x], 
                                                     df$ID[df$drink_freq==.y]))))

#                X1             X2 counts
# 1    Coffee Light    Water Heavy 1
# 2    Coffee Light     Tea Medium 1
# 3    Coffee Light  Coffee Medium 0
# 4    Coffee Light    Water Light 0
# 5    Coffee Light Espresso Light 0
# 6    Coffee Light     Soda Light 0
# 7    Coffee Light   Coffee Heavy 0
# 8     Water Heavy     Tea Medium 1
# 9     Water Heavy  Coffee Medium 0
# 10    Water Heavy    Water Light 0
# 11    Water Heavy Espresso Light 0
# 12    Water Heavy     Soda Light 0
# 13    Water Heavy   Coffee Heavy 0
# 14     Tea Medium  Coffee Medium 1
# 15     Tea Medium    Water Light 1
# 16     Tea Medium Espresso Light 1
# 17     Tea Medium     Soda Light 1
# 18     Tea Medium   Coffee Heavy 0
# 19  Coffee Medium    Water Light 2
# 20  Coffee Medium Espresso Light 1
# 21  Coffee Medium     Soda Light 2
# 22  Coffee Medium   Coffee Heavy 0
# 23    Water Light Espresso Light 1
# 24    Water Light     Soda Light 1
# 25    Water Light   Coffee Heavy 0
# 26 Espresso Light     Soda Light 1
# 27 Espresso Light   Coffee Heavy 0
# 28     Soda Light   Coffee Heavy 0

然后您可以将以上输出调整为列联表。

注意,如果要重塑形状并获得对称输出,则必须通过创建所有可能的组合来修改上述代码,以忽略饮料的顺序,如下所示:

expand.grid(X1=unique(df$drink_freq),
            X2=unique(df$drink_freq), stringsAsFactors = F) %>%
  mutate(counts = map2_dbl(X1, X2, ~length(intersect(df$ID[df$drink_freq==.x], 
                                                     df$ID[df$drink_freq==.y])))) %>% 
  filter(X1 != X2)