我有一个名为vistsPerDay的数据集,看起来像这样,但有405,890行和10,406个唯一的CUST_ID:
> CUST_ID Date
> 1 2013-09-19
> 1 2013-10-03
> 1 2013-10-08
> 1 2013-10-12
> 1 2013-10-20
> 1 2013-10-25
> 1 2013-11-01
> 1 2013-11-02
> 1 2013-11-08
> 1 2013-11-15
> 1 2013-11-23
> 1 2013-12-02
> 1 2013-12-04
> 1 2013-12-09
> 2 2013-09-16
> 2 2013-09-17
> 2 2013-09-18
我想要做的是创建一个新变量,它是访问日期之间的滞后差异。这是我目前正在使用的代码:
visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
if (visitsPerDay$CUST_ID[i] != cust_id) {
cust_id <- visitsPerDay$CUST_ID[i]
visitsPerDay$MTBV <- NA
} else {
visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
}
}
我觉得这肯定不是最有效的方法。有没有人有更好的方法来接近它?谢谢!
答案 0 :(得分:1)
这是data.table
解决方案。这可能会更快,更具可读性:
dt = data.table(visitsPerDay)
dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
# CUST_ID Date MBTV
# 1: 1 2013-09-19 NA days
# 2: 1 2013-10-03 14 days
# 3: 1 2013-10-08 5 days
# 4: 1 2013-10-12 4 days
# 5: 1 2013-10-20 8 days
# 6: 1 2013-10-25 5 days
# 7: 1 2013-11-01 7 days
# 8: 1 2013-11-02 1 days
# 9: 1 2013-11-08 6 days
#10: 1 2013-11-15 7 days
#11: 1 2013-11-23 8 days
#12: 1 2013-12-02 9 days
#13: 1 2013-12-04 2 days
#14: 1 2013-12-09 5 days
#15: 2 2013-09-16 NA days
#16: 2 2013-09-17 1 days
#17: 2 2013-09-18 1 days
答案 1 :(得分:0)
以下是tapply
的方法:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay <- transform(visitsPerDay,
MBTV = unlist(tapply(Date,
CUST_ID,
FUN = function(x) c(NA,diff(x)))))
结果:
CUST_ID Date MBTV
11 1 2013-09-19 NA
12 1 2013-10-03 14
13 1 2013-10-08 5
14 1 2013-10-12 4
15 1 2013-10-20 8
16 1 2013-10-25 5
17 1 2013-11-01 7
18 1 2013-11-02 1
19 1 2013-11-08 6
110 1 2013-11-15 7
111 1 2013-11-23 8
112 1 2013-12-02 9
113 1 2013-12-04 2
114 1 2013-12-09 5
21 2 2013-09-16 NA
22 2 2013-09-17 1
23 2 2013-09-18 1
编辑:更快的方法:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay$MBTV <- c(NA_integer_,
"is.na<-"(diff(visitsPerDay$Date),
!duplicated(visitsPerDay$CUST_ID)[-1]))
答案 2 :(得分:0)
您可以通过执行存储桶排序而不是普通排序来加速进程,因为您按cust_id进行排序。请注意,算法中的瓶颈(根据大O表示法)是排序,即O(nlogn)
。
以下伪代码假设数据按日期排序(答案中建议代码需要相同的假设):
//bucket sort:
customers <- new array of size 10406
for each (cust_id,date):
if customers[cust_id] == nil:
customers[cust_id] = []
customers[cust_id].append(date)
//find differences:
for each list in customers:
i <- list.iter()
prev = i.next()
while (i.hasNext()):
curr <- i.next()
output diff(prev,curr)
prev <- curr
上面的代码在O(n)
中运行,理论上它比你的方法更好(对于足够大的输入),代价是更多的内存消耗。