每周从user_id和日期中查找新的活跃用户

时间:2019-05-01 09:24:53

标签: r dataframe dplyr

背景

假设现在我们经营基于订阅的业务(我们正在这样做)。当客户订购我们的产品时,他们有许多定制选项。 在本练习中,我们将假设以下内容:

●用户注册后,将在“订单”表中为该用户生成一条记录。

○这将是该用户ID的订单中的第一条记录

○“订单”表中的第一个日期将是用户注册的日期。

●用户的第一笔订单发货时间与他们注册的日期相同

●用户可以随时更改交付频率,甚至可以要求交付其他盒子。

○对于此任务,我们不会担心交货频率;主要是因为此示例中的数据是随机生成的,并且在此数据集中观察到的频率节奏违反了自然逻辑;)

●如果或当用户取消时,他们在“订单”表中的最后一个订单之后的14天(含)内保持“活动”状态。 ●从第一个订单到最后一个订单之间的所有天数里,用户都被视为“活跃”。

○对于此任务,我们不会担心探索“重新激活”;即与已取消然后在将来的日期重新注册的用户一起使用。为了简化此练习,我们将那些用户视为从未取消过。

定义

●将一组用户定义为同一时期内首次活跃的用户集合。

●将给定群组的一段时间内的保留率定义为比率:N / D,其中N =群组中在此期间处于活动状态并且最后一个时段也处于活动状态的用户数D =群组中的用户数上期谁很活跃

●根据问题中的指定,将周期定义为从星期日开始的日历月或日历周。

问题

生成具有列的表:

日期| count_new | count_active

count_new:每周有多少新用户注册?

count_active:每周有多少活跃用户?

部分数据

    id user_id total       date payment_status
1       1       1 12783 2017-01-01           paid
2     258       1 12783 2017-01-22           paid
3    1072       1 12783 2017-02-26           paid
4    2086       1 12783 2017-03-26           paid
5    2387       1 12783 2017-04-02           paid
6    3860       1 12783 2017-04-30           paid
7    5546       1 12783 2017-05-28           paid
8       2       2  9516 2017-01-01           paid
9      68       2  9516 2017-01-08           paid
10      3       3 14536 2017-01-01           paid
11    372       3 14536 2017-01-29           paid
12    879       3 14536 2017-02-19           paid
13   1796       3 14536 2017-03-19           paid
14   3451       3 14536 2017-04-23           paid
15   4651       3 14536 2017-05-14           paid
16   5547       3 14536 2017-05-28           paid
17   6920       3 14536 2017-06-18           paid
18   7385       3 14536 2017-06-25           paid
19  10024       3 14536 2017-07-30         unpaid
20  11581       3 14536 2017-07-30         unpaid
21  13138       3 14536 2017-07-30         unpaid
22  14695       3 14536 2017-07-30         unpaid
23      4       4  5755 2017-01-01           paid
24    497       4  5755 2017-02-05           paid
25   1285       4  5755 2017-03-05           paid
26   2699       4  5755 2017-04-09           paid
27   3057       4  5755 2017-04-16           paid
28      5       5 10102 2017-01-01           paid
29    498       5 10102 2017-02-05           paid
30   1529       5 10102 2017-03-12           paid
31   2087       5 10102 2017-03-26           paid
32   2388       5 10102 2017-04-02           paid
33      6       6 13552 2017-01-01           paid
34     69       6 13552 2017-01-08           paid




structure(list(id = 1:100, user_id = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 
20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 
33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 
46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 
59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 2L, 6L, 10L, 12L, 
17L, 21L, 27L, 29L, 36L, 37L, 40L, 49L, 55L, 59L, 61L, 67L, 68L, 
69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 
82L, 83L, 84L), total = c(12783L, 9516L, 14536L, 5755L, 10102L, 
13552L, 6940L, 12154L, 14639L, 8034L, 10912L, 12255L, 8016L, 
6483L, 9841L, 14813L, 10934L, 5194L, 7753L, 5544L, 13813L, 9739L, 
13630L, 5281L, 10607L, 14873L, 13441L, 12998L, 10162L, 8110L, 
8269L, 9118L, 12308L, 14144L, 5789L, 7364L, 11921L, 5276L, 11695L, 
6669L, 7872L, 12890L, 7636L, 11682L, 14620L, 10876L, 12273L, 
14560L, 6787L, 13150L, 5559L, 13086L, 6957L, 6862L, 12442L, 10948L, 
12293L, 8398L, 8796L, 14986L, 6235L, 12077L, 5013L, 11953L, 7891L, 
13551L, 14988L, 9516L, 13552L, 8034L, 12255L, 10934L, 13813L, 
13441L, 10162L, 7364L, 11921L, 6669L, 6787L, 12442L, 8796L, 6235L, 
14988L, 10769L, 10875L, 10603L, 12522L, 5475L, 9343L, 6860L, 
11969L, 7392L, 9487L, 13016L, 6284L, 9801L, 6581L, 9164L, 11898L, 
9210L), date = structure(c(17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 
17167, 17167, 17167, 17167, 17167, 17167, 17167, 17167, 17174, 
17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 
17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 
17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 17174, 
17174, 17174, 17174, 17174, 17174), class = "Date"), payment_status = c("paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid", "paid", "paid", "paid", "paid", "paid", 
"paid", "paid", "paid")), row.names = c(NA, 100L), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

所以我设法通过检查user_id的首次出现来计算count_new 然后与初始数据合并,添加一列,以按日期和ID告诉用户是否是新用户,然后按日期对新用户进行计数:

library(dplyr)
firstshow<-Orders %>%
group_by(user_id) %>%
  arrange(date) %>%
  slice(1L) %>%
  mutate(new = "new")

newdata<-merge.data.frame(Orders,firstshow,by=c("date","user_id"),all = T)
count<-newdata %>%
  filter(new=="new" ) %>%
  group_by(date) %>%
 tally()
names(count)[2]<-"count_new"