R data.table在时间段内找到新遇到的级别

时间:2017-03-06 19:22:59

标签: r data.table

背景

我有一个客户数据集,他们的相关购买和购买日期。我需要找到一个"下一个产品"。下一个产品被定义为客户在过去24个月内未购买的产品,下一个产品将成为当前产品。请注意,名称"下一个产品"并不意味着我试图预测任何事情。我只想检查客户是否在过去24个月内购买了这个特定产品,如果没有,下一个产品是当前产品。

数据

以下是一个最小的工作示例:

> test.dt
    ID   Product Diff_date
 1:  1 Product_C    0 days
 2:  1 Product_A   91 days
 3:  1 Product_A  122 days
 4:  1 Product_A  700 days
 5:  1 Product_A  700 days
 6:  1 Product_A  700 days
 7:  1 Product_A  731 days
 8:  1 Product_A  731 days
 9:  1 Product_C  761 days
10:  1 Product_A  761 days
11:  2 Product_A   30 days
12:  2 Product_B   60 days
13:  2 Product_C   91 days

具有所需结果的data.table,手动生成:

> test.dt.outcome
    ID   Product Diff_date Next_product
 1:  1 Product_C    0 days         None
 2:  1 Product_A   91 days         None
 3:  1 Product_A  122 days         None
 4:  1 Product_A  700 days         None
 5:  1 Product_A  700 days         None
 6:  1 Product_A  700 days         None
 7:  1 Product_A  731 days         None
 8:  1 Product_A  731 days         None
 9:  1 Product_C  761 days    Product_C
10:  1 Product_A  761 days         None
11:  2 Product_A   30 days         None
12:  2 Product_B   60 days         None
13:  2 Product_C   91 days         None

我们可以看到ID = 1的客户先前已经购买了Product_C但未在最后24 * 30 = 720天内购买,因此下一个产品是产品C.另一方面我们有ID = 2的客户购买了不同的产品,但在24个月的时间内,所以我们没有定义新产品。

解决方案

我正在寻找使用data.table包的解决方案,但也欢迎其他方法。

Dputs

library(data.table)
test.dt <- setDT(structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L), Product = c("Product_A", "Product_A", "Product_A", 
"Product_A", "Product_A", "Product_A", "Product_A", "Product_A", 
"Product_C", "Product_A", "Product_A", "Product_B", "Product_C"
), Diff_date = structure(c(0, 91, 122, 700, 700, 700, 731, 731, 
761, 761, 30, 60, 91), units = "days", class = "difftime")), .Names = c("ID", 
"Product", "Diff_date"), row.names = c(NA, -13L), class = c("data.table", 
"data.frame")))


test.dt.out <- setDT(structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L), Product = c("Product_A", "Product_A", "Product_A", 
"Product_A", "Product_A", "Product_A", "Product_A", "Product_A", 
"Product_C", "Product_A", "Product_A", "Product_B", "Product_C"
), Diff_date = structure(c(0, 91, 122, 700, 700, 700, 731, 731, 
761, 761, 30, 60, 91), units = "days", class = "difftime"), Next_product = c("Product_C", 
"None", "None", "None", "None", "None", "None", "None", "Product_C", 
"None", "None", "None", "None")), .Names = c("ID", "Product", 
"Diff_date", "Next_product"), row.names = c(NA, -13L), class = c("data.table", 
"data.frame")))

0 个答案:

没有答案