Question

我有一个家庭零售购买产品的数据集。对于每个家庭旅行，我想查看在该旅行期间购买的任何品牌是否在前一个时期购买 - 如果是，忠诚= 1其他忠诚= 0。我有一个包含数十亿观测值的大型数据集，因此效率越高越好。：）

library(data.table)
household <-  as.integer(c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3))
trip      <- as.integer(c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9))
brand     <- as.integer(c(1,2,3,4,5,6,7,5,1,6,8,9,9,2,8,1,3,4,5,6,7,8,9,1,1,2,3,4,1,5,6,7,1,8,9,2))
DT <- data.table(household,trip,brand)

期望的输出：

> DT
             household trip loyal
[1,]         1        1    NA
[2,]         1        2     0
[3,]         1        3     1
[4,]         2        4    NA
[5,]         2        5     0
[6,]         2        6     0
[7,]         3        7    NA
[8,]         3        8     1
[9,]         3        9     1

我试过这样的事情，但显然没有产生所需的输出。

DT$loyal <- 0
for (h in unique(DT$household)){
  for (t in unique(DT$trip)){
    DT[brand %in% (DT[trip=t-1]$brand)]$loyal <- 1
  }}

Answer 1

您可以自行加入以获取索引，然后再次加入household和trip的唯一组合。想到这一点

# Create a column of the previous trip
DT[, prev_trip := trip - 1L]

# Self join
indx <- 
  DT[DT 
   ,.(household, trip)
   ,on = .(household, prev_trip = trip, brand)
   ,nomatch = 0L]

# A unique combination `household` and `trip` joined with the index
res <- unique(DT[, .(household, trip)])[indx, on = .(household, trip), loyal := 1L]
res
#    household trip loyal
# 1:         1    1    NA
# 2:         1    2    NA
# 3:         1    3     1
# 4:         2    4    NA
# 5:         2    5    NA
# 6:         2    6    NA
# 7:         3    7    NA
# 8:         3    8     1
# 9:         3    9     1

不确定0是否至关重要，因为它们对我来说看起来不太有用，但如果需要可以在以后添加它们

在前一段时间内查找价值的最快方式？

1 个答案: