检测新因子水平按时间和组累计出现

时间:2018-05-06 22:02:05

标签: r time-series

假设零售商想要检测客户是否为每次访问购买了新的产品类别,以及每次访问时购买的累积唯一类别。在这个例子中,汤姆在时间1和时间2购买纸张,但是时间2的纸张不算作新产品类别,因为他已经在时间1购买纸张。总累积独特产品是变化的"时间&# 34;水平。假如时间意味着一周,我们对本周之前的独特产品感兴趣。

数据

user<-c("Tom","Tom","Tom","Tom","Tom","Jim","Jim")
t<-c("1", "1", "1","2","2","1","2")
product<-c("cpu","paper","ssd","watch","paper","water","water")
dt<-data.frame(user,t,product)
  user t product
1  Tom 1     cpu
2  Tom 1   paper
3  Tom 1     ssd
4  Tom 2   watch
5  Tom 2   paper
6  Jim 1   water
7  Jim 2   water

期望的输出

  user t product new_product_dummy total_cumulative_unique_product
1  Tom 1     cpu                 y                               3
2  Tom 1   paper                 y                               3
3  Tom 1     ssd                 y                               3
4  Tom 2   watch                 y                               4
5  Tom 2   paper                 n                               4
6  Jim 1   water                 y                               1
7  Jim 2   water                 n                               1

我的逻辑是将购买的产品与最新的累积独特因子水平进行比较,但我无法确定如何编码。

日期

1 个答案:

答案 0 :(得分:1)

我不明白为什么前三行的total_cumulative_unique_product等于3,因为这似乎不是累积号码。所以我认为这是一个错误(如果这实际上是正确的,请跳到选项2)。

选项1

您可以使用tidyverse方法执行以下操作:

library(tidyverse);
library(tidyverse);
dt %>%
    group_by(user, product) %>%
    mutate(
        n = 1:n(),
        new_product_dummy = ifelse(n == 1, "y", "n")) %>%
    select(-n) %>%
    group_by(user) %>%
    mutate(
        total_cumulative_unique_product = cumsum(new_product_dummy == "y"))
## A tibble: 7 x 5
## Groups:   user [2]
#  user  t     product new_product_dummy total_cumulative_unique_product
#  <fct> <fct> <fct>   <chr>                                       <int>
#1 Tom   1     cpu     y                                               1
#2 Tom   1     paper   y                                               2
#3 Tom   1     ssd     y                                               3
#4 Tom   2     watch   y                                               4
#5 Tom   2     paper   n                                               4
#6 Jim   1     water   y                                               1
#7 Jim   2     water   n                                               1

选项2

完全重现您可以做的预期输出

dt %>%
    group_by(user, product) %>%
    mutate(
        n = 1:n(),
        new_product_dummy = ifelse(n == 1, "y", "n")) %>%
    select(-n) %>%
    group_by(user) %>%
    mutate(
        total_cumulative_unique_product = cumsum(new_product_dummy == "y")) %>%
    group_by(user, t) %>%
    mutate(
        total_cumulative_unique_product = max(total_cumulative_unique_product))
## A tibble: 7 x 5
## Groups:   user, t [4]
#  user  t     product new_product_dummy total_cumulative_unique_product
#  <fct> <fct> <fct>   <chr>                                       <dbl>
#1 Tom   1     cpu     y                                              3.
#2 Tom   1     paper   y                                              3.
#3 Tom   1     ssd     y                                              3.
#4 Tom   2     watch   y                                              4.
#5 Tom   2     paper   n                                              4.
#6 Jim   1     water   y                                              1.
#7 Jim   2     water   n                                              1.

更新

确保t - user群组级别的排序:

dt %>%
    arrange(user, t) %>%
    group_by(user, product) %>%
    mutate(
        n = 1:n(),
        new_product_dummy = ifelse(n == 1, "y", "n")) %>%
    select(-n) %>%
    group_by(user) %>%
    mutate(
        total_cumulative_unique_product = cumsum(new_product_dummy == "y")) %>%
    group_by(user, t) %>%
    mutate(
        total_cumulative_unique_product = max(total_cumulative_unique_product))
## A tibble: 7 x 5
## Groups:   user, t [4]
#  user  t     product new_product_dummy total_cumulative_unique_product
#  <fct> <fct> <fct>   <chr>                                       <dbl>
#1 Jim   1     water   y                                              1.
#2 Jim   2     water   n                                              1.
#3 Tom   1     cpu     y                                              3.
#4 Tom   1     paper   y                                              3.
#5 Tom   1     ssd     y                                              3.
#6 Tom   2     watch   y                                              4.
#7 Tom   2     paper   n                                              4.