计算具有相同值的一组元素的组合

时间:2017-08-08 20:59:04

标签: r combinations

对不起我的英语,我需要一些帮助。

使用此数据集:

+--------+------------+---------+---------+----------+
| PEOPLE |    DATE    | EVENT_A | EVENT_B | BEVENT_C |
+--------+------------+---------+---------+----------+
| MIKE   | 04/08/2013 |       1 |       1 |        1 |
| PETE   | 10/08/2013 |       1 |       0 |        1 |
| PETE   | 25/08/2013 |       1 |       0 |        1 |
| PETE   | 15/09/2013 |       1 |       0 |        1 |
| MIKE   | 28/09/2013 |       1 |       1 |        1 |
| PETE   | 19/10/2013 |       1 |       1 |        1 |
| MIKE   | 30/10/2013 |       0 |       1 |        1 |
| MIKE   | 09/11/2013 |       1 |       1 |        1 |
+--------+------------+---------+---------+----------+

基本上我需要计算按n事件分组的组合数,其值为1.我不知道在R中采用什么方法来实现这一点。输出应该是这样的:

+-------+-------+------------------------+---------+---------+--------+
| #MIKE | #PETE | #N EVENTS COMBINATIONS |         |         |        |
+-------+-------+------------------------+---------+---------+--------+
|     3 |     1 | COMBINATIONS WITH 2    | EVENT A | EVENT B |        |
|     2 |     4 | COMBINATIONS WITH 2    | EVENT A | EVENT C |        |
|     4 |     1 | COMBINATIONS WITH 2    | EVENT B | EVENT C |        |
|     3 |     2 | COMBINATIONS WITH 3    | EVENT A | EVENT B | EVENT C|
+-------+-------+------------------------+---------+---------+--------+

我需要为每个人和任意数量的独特事件(列)

提前致谢 文斯。

1 个答案:

答案 0 :(得分:0)

一种可能性是使用dplyr,管道和tidyr(详情了解herehere)。

鉴于您的数据,我会像这样解决您的问题:

library(dplyr)  # for data manipulation and piping
library(tidyr)  # for data reshaping

# 1. create the data
df <- data_frame(
 people = c("Mike", "Pete", "Pete", "Pete", "Mike", "Pete", "Mike", "Mike"),
 event_a = c(rep(1, 6), 0, 1),
 event_b = c(1, 0, 0, 0, rep(1, 4)),
 event_c = c(rep(1, 8))
)

# create a dummy var for each event-combination
df2 <- df %>% 
 mutate(ab = event_a & event_b,
        ac = event_a & event_c,
        bc = event_b & event_c,
        abc = event_a & event_b & event_c)

# reshape data to the long format using tidyr::gather
df3 <- df2 %>% 
 # we dont need the original events anymore -> deselect them
 select(-contains("event")) %>% 
 # reshape from long to short
 gather("var", "value", -people) %>%
 # filter only the positive matches
 filter(value == T)

df3 %>% 
 # for each combination ...
 group_by(var) %>% 
 # ... count the number of cases
 summarise(n_mike = sum(people == "Mike"),
           n_pete = sum(people == "Pete")) %>%
 # create the text-variable
 mutate(event_combs = sprintf("Combinations with %d", nchar(var))) %>% 
 # reorder to have it your format
 select(n_mike, n_pete, event_combs, var)
#> # A tibble: 4 x 4
#>   n_mike n_pete         event_combs   var
#>    <int>  <int>               <chr> <chr>
#> 1      3      1 Combinations with 2    ab
#> 2      3      1 Combinations with 3   abc
#> 3      3      4 Combinations with 2    ac
#> 4      4      1 Combinations with 2    bc

概括

要将此概括为任意*多个事件(* {当我们使用letters时为*最多26个,要扩展为剩余的练习...),我们可以使用expand.grid()生成所有可能的事件组合,然后使用apply过滤相应的组合。

代码如下所示:

df <- data_frame(
 people = c("Mike", "Pete", "Pete", "Pete", "Mike", "Pete", "Mike", "Mike"),
 event_a = c(rep(1, 6), 0, 1),
 event_b = c(1, 0, 0, 0, rep(1, 4)),
 event_c = c(rep(1, 8)),
 event_d = c(1, 1, 0, 0, 0, 1, 0, 1)
)

# take only the events
df_events <- df %>% select(starts_with("event"))
# create all possible event combinations
# also: discard the first rows (all-zeros)
event_combs <- expand.grid(rep(list(0:1), ncol(df_events)))[-1, ]

# 'loop' over the possible combinations, and find the matches
res_list <- apply(event_combs, 1, function(row) {
 # row now contains which events we choose
 row <- as.logical(row)
 # var now contains the names of the events. i.e., 'a', 'abc', or 'bc'
 var <- paste(letters[1:length(row)][row], collapse = "")

 # combine the data into a data_frame
 data_frame(var = var,
            people = df$people,
            # check if per row all selected events are true
            value = rowSums(df_events[, row]) == sum(row))
})

# bind the results together
df3 <- bind_rows(res_list) %>% filter(value == T)

# same as before...
df3 %>% 
 group_by(var) %>% 
 summarise(n_mike = sum(people == "Mike"),
           n_pete = sum(people == "Pete"))
#> # A tibble: 15 x 3
#>      var n_mike n_pete
#>    <chr>  <int>  <int>
#>  1     a      3      4
#>  2    ab      3      1
#>  3   abc      3      1
#>  4  abcd      2      1
#>  5   abd      2      1
#>  6    ac      3      4
#>  7   acd      2      2
#>  8    ad      2      2
#>  9     b      4      1
#> 10    bc      4      1
#> 11   bcd      2      1
#> 12    bd      2      1
#> 13     c      4      4
#> 14    cd      2      2
#> 15     d      2      2