R-按列状态子集,并根据另一个数据框计数唯一记录

时间:2019-04-03 05:32:24

标签: r

我有一个看起来像这样的数据集:

customer_id    group_a    group_b    group_c    group_d
123            true       false      true       false
456            false      true       false      true
789            false      true       true       false

我在这样的数据集中也有每个客户的记录。

customer_id    date
123            01/01/2019
123            01/02/2019
123            01/03/2019
123            01/04/2019
123            01/04/2019  

456            01/01/2019
456            01/02/2019
456            01/03/2019

789            01/01/2019
789            01/03/2019
789            01/03/2019

我希望能够获得客户为“ true”的每个组迭代的按日期的唯一记录数,以及每个组的客户总数< / strong>。结果如下:

date         group    record   total
01/01/2019   a        1        1
01/02/2019   a        1        1
01/03/2019   a        1        1
01/04/2019   a        1        1

01/01/2019   b        2        2
01/02/2019   b        1        2
01/03/2019   b        2        2
01/04/2019   b        0        2

01/01/2019   c        2        2
01/02/2019   c        1        2
01/03/2019   c        2        2
01/04/2019   c        1        2

01/01/2019   d        1        1
01/02/2019   d        1        1
01/03/2019   d        1        1
01/04/2019   d        0        1

1 个答案:

答案 0 :(得分:1)

我觉得这不是很优雅,但是结果符合您的预期输出,所以:就在这里。


library(lubridate)
library(dplyr)
library(tidyr)

df2$date <- mdy(df2$date)

df2 %>% 
  inner_join(df1, by = "customer_id", copy = TRUE) %>%
  gather(key = "group", value = "member", group_a:group_d) %>%
  filter(member == "true") %>% 
  complete(date, group) %>%
  select(date, group, customer_id) ->  df3

df3 %>%
  group_by(group, date) %>% 
  summarise(record = n_distinct(customer_id, na.rm = TRUE)) %>% 
  left_join( df3 %>%
             group_by(group) %>%
             summarise(total = n_distinct(customer_id, na.rm = TRUE)),
             by = "group") %>% ungroup() %>%
  select(date, group, record, total) -> result

给出:

# A tibble: 16 x 4
   date       group   record total
   <date>     <chr>    <int> <int>
 1 2019-01-01 group_a      1     1
 2 2019-01-02 group_a      1     1
 3 2019-01-03 group_a      1     1
 4 2019-01-04 group_a      1     1
 5 2019-01-01 group_b      2     2
 6 2019-01-02 group_b      1     2
 7 2019-01-03 group_b      2     2
 8 2019-01-04 group_b      0     2
 9 2019-01-01 group_c      2     2
10 2019-01-02 group_c      1     2
11 2019-01-03 group_c      2     2
12 2019-01-04 group_c      1     2
13 2019-01-01 group_d      1     1
14 2019-01-02 group_d      1     1
15 2019-01-03 group_d      1     1
16 2019-01-04 group_d      0     1