我有两个数据框:使用以下代码生成dfUsers
和purchases
:
set.seed(1)
library(data.table)
dfUsers <- data.table(user = letters[1:5],
startDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 3)
)
dfUsers$endDate <- dfUsers$startDate + sample(30:90,1)
purchases <- data.table(
user = sample(letters[1:5], 500, replace = TRUE),
purchaseDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 500, replace = TRUE),
amount = runif(50,300, 500)
)
对于每个用户,我想在startDate和endDate之间的期间内将所有购买加在一起。
我目前的方法是在函数上使用dplyr mutate,但随着两个表的增长,这种方法非常慢。
我正在学习R所以我想知道是否有更有效的方法来解决这种性质的问题?
功能:
addPurchases <- function(u, startDate, endDate) {
purchases[user == u & startDate <= purchaseDate & endDate >= purchaseDate, sum(amount)]
}
dplyr
链
library(dplyr)
dfUsers %>%
rowwise() %>%
mutate(totalPurchase = addPurchases(user, startDate, endDate))
答案 0 :(得分:4)
快速,干净且内存有效的解决方案是使用非等连接。
purchases[dfUsers, on = .(user, purchaseDate >= startDate, purchaseDate <= endDate),
sum(amount), by = .EACHI]
# user purchaseDate purchaseDate V1
#1: a 2016-07-06 2016-09-29 6929.469
#2: b 2016-09-20 2016-12-14 6563.416
#3: c 2017-02-08 2017-05-04 3607.794
#4: d 2016-07-06 2016-09-29 5591.748
#5: e 2016-09-20 2016-12-14 5727.622
答案 1 :(得分:2)
答案 2 :(得分:1)
使用data.table
- merge
两个表并按sum
计算user
的解决方案:
library(data.table)
# Using OPs data
merge(dfUsers,
purchases,
"user")[purchaseDate >= startDate & purchaseDate <= endDate,
sum(amount),
user]
# user V1
# 1: a 6929.469
# 2: b 6563.416
# 3: c 3607.794
# 4: d 5591.748
# 5: e 5727.622