假设零售商想要检测客户是否为每次访问购买了新的产品类别,以及每次访问时购买的累积唯一类别。在这个例子中,汤姆在时间1和时间2购买纸张,但是时间2的纸张不算作新产品类别,因为他已经在时间1购买纸张。总累积独特产品是变化的"时间&# 34;水平。假如时间意味着一周,我们对本周之前的独特产品感兴趣。
数据
user<-c("Tom","Tom","Tom","Tom","Tom","Jim","Jim")
t<-c("1", "1", "1","2","2","1","2")
product<-c("cpu","paper","ssd","watch","paper","water","water")
dt<-data.frame(user,t,product)
user t product
1 Tom 1 cpu
2 Tom 1 paper
3 Tom 1 ssd
4 Tom 2 watch
5 Tom 2 paper
6 Jim 1 water
7 Jim 2 water
期望的输出
user t product new_product_dummy total_cumulative_unique_product
1 Tom 1 cpu y 3
2 Tom 1 paper y 3
3 Tom 1 ssd y 3
4 Tom 2 watch y 4
5 Tom 2 paper n 4
6 Jim 1 water y 1
7 Jim 2 water n 1
我的逻辑是将购买的产品与最新的累积独特因子水平进行比较,但我无法确定如何编码。
日期
答案 0 :(得分:1)
我不明白为什么前三行的total_cumulative_unique_product
等于3
,因为这似乎不是累积号码。所以我认为这是一个错误(如果这实际上是正确的,请跳到选项2)。
您可以使用tidyverse
方法执行以下操作:
library(tidyverse);
library(tidyverse);
dt %>%
group_by(user, product) %>%
mutate(
n = 1:n(),
new_product_dummy = ifelse(n == 1, "y", "n")) %>%
select(-n) %>%
group_by(user) %>%
mutate(
total_cumulative_unique_product = cumsum(new_product_dummy == "y"))
## A tibble: 7 x 5
## Groups: user [2]
# user t product new_product_dummy total_cumulative_unique_product
# <fct> <fct> <fct> <chr> <int>
#1 Tom 1 cpu y 1
#2 Tom 1 paper y 2
#3 Tom 1 ssd y 3
#4 Tom 2 watch y 4
#5 Tom 2 paper n 4
#6 Jim 1 water y 1
#7 Jim 2 water n 1
要完全重现您可以做的预期输出
dt %>%
group_by(user, product) %>%
mutate(
n = 1:n(),
new_product_dummy = ifelse(n == 1, "y", "n")) %>%
select(-n) %>%
group_by(user) %>%
mutate(
total_cumulative_unique_product = cumsum(new_product_dummy == "y")) %>%
group_by(user, t) %>%
mutate(
total_cumulative_unique_product = max(total_cumulative_unique_product))
## A tibble: 7 x 5
## Groups: user, t [4]
# user t product new_product_dummy total_cumulative_unique_product
# <fct> <fct> <fct> <chr> <dbl>
#1 Tom 1 cpu y 3.
#2 Tom 1 paper y 3.
#3 Tom 1 ssd y 3.
#4 Tom 2 watch y 4.
#5 Tom 2 paper n 4.
#6 Jim 1 water y 1.
#7 Jim 2 water n 1.
确保t
- user
群组级别的排序:
dt %>%
arrange(user, t) %>%
group_by(user, product) %>%
mutate(
n = 1:n(),
new_product_dummy = ifelse(n == 1, "y", "n")) %>%
select(-n) %>%
group_by(user) %>%
mutate(
total_cumulative_unique_product = cumsum(new_product_dummy == "y")) %>%
group_by(user, t) %>%
mutate(
total_cumulative_unique_product = max(total_cumulative_unique_product))
## A tibble: 7 x 5
## Groups: user, t [4]
# user t product new_product_dummy total_cumulative_unique_product
# <fct> <fct> <fct> <chr> <dbl>
#1 Jim 1 water y 1.
#2 Jim 2 water n 1.
#3 Tom 1 cpu y 3.
#4 Tom 1 paper y 3.
#5 Tom 1 ssd y 3.
#6 Tom 2 watch y 4.
#7 Tom 2 paper n 4.