行有效变异的有效方法

时间:2017-12-08 19:58:45

标签: r function dplyr data.table

我有两个数据框:使用以下代码生成dfUserspurchases

set.seed(1)
library(data.table)

dfUsers <- data.table(user = letters[1:5],
                      startDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 3)
                      )

dfUsers$endDate <- dfUsers$startDate + sample(30:90,1)

purchases <- data.table(
  user = sample(letters[1:5], 500, replace = TRUE),
  purchaseDate = sample(seq.Date(from = as.Date('2016-01-01'), to = Sys.Date(), by = '1 day'), 500, replace = TRUE),
  amount = runif(50,300, 500)
)

对于每个用户,我想在startDate和endDate之间的期间内将所有购买加在一起。

我目前的方法是在函数上使用dplyr mutate,但随着两个表的增长,这种方法非常慢。

我正在学习R所以我想知道是否有更有效的方法来解决这种性质的问题?

功能:

addPurchases <- function(u, startDate, endDate) {
  purchases[user == u & startDate <= purchaseDate & endDate >= purchaseDate, sum(amount)]
}

dplyr

library(dplyr)
dfUsers %>% 
  rowwise() %>%
  mutate(totalPurchase = addPurchases(user, startDate, endDate))

3 个答案:

答案 0 :(得分:4)

快速,干净且内存有效的解决方案是使用非等连接。

purchases[dfUsers, on = .(user, purchaseDate >= startDate, purchaseDate <= endDate),
          sum(amount), by = .EACHI]
#   user purchaseDate purchaseDate       V1
#1:    a   2016-07-06   2016-09-29 6929.469
#2:    b   2016-09-20   2016-12-14 6563.416
#3:    c   2017-02-08   2017-05-04 3607.794
#4:    d   2016-07-06   2016-09-29 5591.748
#5:    e   2016-09-20   2016-12-14 5727.622

答案 1 :(得分:2)

使用dplyr的解决方案。我们的想法是按用户合并数据框,按日期过滤数据,然后按用户汇总总金额。 库(dplyr) dfUsers2&lt; - dfUsers%&gt;%   full_join(购买,按=&#34;用户&#34;)%&gt;%   过滤器(purchaseDate&gt; = startDate,purchaseDate&lt; = endDate)%&gt;%   group_by(user)%&gt;%   总结(总和=总和(金额,na.rm = TRUE)) dfUsers2 ## A tibble:5 x 2 #user Total #&lt; chr&gt; &LT; DBL&GT; #1 a 6929.469 #2 b 6563.416 #3 c 3607.794 #4 d 5591.748 #5 e 5727.622

答案 2 :(得分:1)

使用data.table - merge两个表并按sum计算user的解决方案:

library(data.table)
# Using OPs data
merge(dfUsers, 
      purchases, 
      "user")[purchaseDate >= startDate & purchaseDate <= endDate, 
              sum(amount), 
              user]
#    user       V1
# 1:    a 6929.469
# 2:    b 6563.416
# 3:    c 3607.794
# 4:    d 5591.748
# 5:    e 5727.622