我有一个家庭零售购买产品的数据集。对于每个家庭旅行,我想查看在该旅行期间购买的任何品牌是否在前一个时期购买 - 如果是,忠诚= 1其他忠诚= 0。我有一个包含数十亿观测值的大型数据集,因此效率越高越好。 :)
library(data.table)
household <- as.integer(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3))
trip <- as.integer(c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
brand <- as.integer(c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2))
DT <- data.table(household,trip,brand)
期望的输出:
> DT
household trip loyal
[1,] 1 1 NA
[2,] 1 2 0
[3,] 1 3 1
[4,] 2 4 NA
[5,] 2 5 0
[6,] 2 6 0
[7,] 3 7 NA
[8,] 3 8 1
[9,] 3 9 1
我试过这样的事情,但显然没有产生所需的输出。
DT$loyal <- 0
for (h in unique(DT$household)){
for (t in unique(DT$trip)){
DT[brand %in% (DT[trip=t-1]$brand)]$loyal <- 1
}}
答案 0 :(得分:1)
您可以自行加入以获取索引,然后再次加入household
和trip
的唯一组合。想到这一点
# Create a column of the previous trip
DT[, prev_trip := trip - 1L]
# Self join
indx <-
DT[DT
,.(household, trip)
,on = .(household, prev_trip = trip, brand)
,nomatch = 0L]
# A unique combination `household` and `trip` joined with the index
res <- unique(DT[, .(household, trip)])[indx, on = .(household, trip), loyal := 1L]
res
# household trip loyal
# 1: 1 1 NA
# 2: 1 2 NA
# 3: 1 3 1
# 4: 2 4 NA
# 5: 2 5 NA
# 6: 2 6 NA
# 7: 3 7 NA
# 8: 3 8 1
# 9: 3 9 1
不确定0是否至关重要,因为它们对我来说看起来不太有用,但如果需要可以在以后添加它们