我有一个客户数据集,他们的相关购买和购买日期。我需要找到一个"下一个产品"。下一个产品被定义为客户在过去24个月内未购买的产品,下一个产品将成为当前产品。请注意,名称"下一个产品"并不意味着我试图预测任何事情。我只想检查客户是否在过去24个月内购买了这个特定产品,如果没有,下一个产品是当前产品。
以下是一个最小的工作示例:
> test.dt
ID Product Diff_date
1: 1 Product_C 0 days
2: 1 Product_A 91 days
3: 1 Product_A 122 days
4: 1 Product_A 700 days
5: 1 Product_A 700 days
6: 1 Product_A 700 days
7: 1 Product_A 731 days
8: 1 Product_A 731 days
9: 1 Product_C 761 days
10: 1 Product_A 761 days
11: 2 Product_A 30 days
12: 2 Product_B 60 days
13: 2 Product_C 91 days
具有所需结果的data.table,手动生成:
> test.dt.outcome
ID Product Diff_date Next_product
1: 1 Product_C 0 days None
2: 1 Product_A 91 days None
3: 1 Product_A 122 days None
4: 1 Product_A 700 days None
5: 1 Product_A 700 days None
6: 1 Product_A 700 days None
7: 1 Product_A 731 days None
8: 1 Product_A 731 days None
9: 1 Product_C 761 days Product_C
10: 1 Product_A 761 days None
11: 2 Product_A 30 days None
12: 2 Product_B 60 days None
13: 2 Product_C 91 days None
我们可以看到ID = 1的客户先前已经购买了Product_C但未在最后24 * 30 = 720天内购买,因此下一个产品是产品C.另一方面我们有ID = 2的客户购买了不同的产品,但在24个月的时间内,所以我们没有定义新产品。
我正在寻找使用data.table包的解决方案,但也欢迎其他方法。
library(data.table)
test.dt <- setDT(structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L), Product = c("Product_A", "Product_A", "Product_A",
"Product_A", "Product_A", "Product_A", "Product_A", "Product_A",
"Product_C", "Product_A", "Product_A", "Product_B", "Product_C"
), Diff_date = structure(c(0, 91, 122, 700, 700, 700, 731, 731,
761, 761, 30, 60, 91), units = "days", class = "difftime")), .Names = c("ID",
"Product", "Diff_date"), row.names = c(NA, -13L), class = c("data.table",
"data.frame")))
test.dt.out <- setDT(structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L), Product = c("Product_A", "Product_A", "Product_A",
"Product_A", "Product_A", "Product_A", "Product_A", "Product_A",
"Product_C", "Product_A", "Product_A", "Product_B", "Product_C"
), Diff_date = structure(c(0, 91, 122, 700, 700, 700, 731, 731,
761, 761, 30, 60, 91), units = "days", class = "difftime"), Next_product = c("Product_C",
"None", "None", "None", "None", "None", "None", "None", "Product_C",
"None", "None", "None", "None")), .Names = c("ID", "Product",
"Diff_date", "Next_product"), row.names = c(NA, -13L), class = c("data.table",
"data.frame")))