为ActivityType执行21天滚动总和的最快方法

时间:2015-12-24 17:04:19

标签: r data.table dplyr zoo

我有一个大型数据帧(3M +行)。我试图计算某个ActivityType在21天窗口中出现的次数。我已经从Rolling Sum by Another Variable in R建模了我的解决方案。但是只需要一个ActivityType就需要很长时间。我认为3M +行不会占用过多的时间。以下是我的尝试:

dt <- read.table(text='

                         Name      ActivityType     ActivityDate                
                         John       Email            1/1/2014           
                         John       Email            1/3/2014                
                         John       Webinar          1/5/2014          
                         John       Webinar          1/20/2014          
                         John       Webinar          3/25/2014          
                         John       Email            4/1/2014           
                         John       Email            4/20/2014          
                         Tom        Email            1/1/2014           
                         Tom       Webinar           1/5/2014           
                         Tom       Webinar           1/20/2014          
                         Tom       Webinar           3/25/2014          
                         Tom       Email             4/1/2014           
                         Tom       Email             4/20/2014          

                         ', header=T, row.names = NULL)

        library(data.table)
        library(reshape2)
        dt$ActivityType <- factor(dt$ActivityType)   
        dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")  
        dt <- dt[order(dt$Name, dt$ActivityDate),]

   dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
   setDT(dt)
   #Build reference table
        Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
    #Use mapply to get last 21 days of value by Name    
    dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
            d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
            sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]

这仅适用于ActivityType = Email,然后我必须对其他ActivityType级别执行相同操作。我得到解决方案的链接谈到使用“mcapply”而不是“mapply”。请告诉我如何使用mcapply或任何其他可以加快速度的解决方案。

以下是预期产量。对于每一行,我在此之前21天采用ActivityDate,而21天是我的时间窗口。我一直计算ActivityType =“Email”出现在该时间窗口中。

              Name      ActivityType     ActivityDate  Email_RollingSum             
                 John       Email            1/1/2014         1  
                 John       Email            1/3/2014         2       
                 John       Webinar          1/5/2014         2 
                 John       Webinar          1/20/2014        2  
                 John       Webinar          3/25/2014        0  
                 John       Email            4/1/2014         1  
                 John       Email            4/20/2014        2 
                 Tom        Email            1/1/2014         1  
                 Tom       Webinar           1/5/2014         1  
                 Tom       Webinar           1/20/2014        1  
                 Tom       Webinar           3/25/2014        0  
                 Tom       Email             4/1/2014         1  
                 Tom       Email             4/20/2014        2

2 个答案:

答案 0 :(得分:6)

setDT(dt)
dt[, ActivityDate := as.Date(ActivityDate, '%m/%d/%Y')]

# add index to keep track of rows
dt[, idx := .I]

# match the dates we're looking for using a rolling join and extract the row numbers
rr = dt[.(Name = Name, ActivityDate = ActivityDate - 21, refIdx = idx),
       .(idx, refIdx), on = c('Name', 'ActivityDate'), roll = -Inf]
#    idx refIdx
# 1:   1      1
# 2:   1      2
# 3:   1      3
# 4:   1      4
# 5:   5      5
# 6:   5      6
# 7:   6      7
# 8:   8      8
# 9:   8      9
#10:   8     10
#11:  11     11
#12:  11     12
#13:  12     13

# extract the above rows and count occurrences using dcast
dcast(rr[, {seq = idx:refIdx; dt[seq]}, by = 1:nrow(rr)], nrow ~ ActivityType)
#   nrow Email Webinar
#1     1     1       0
#2     2     2       0
#3     3     2       1
#4     4     2       2
#5     5     0       1
#6     6     1       1
#7     7     2       0
#8     8     1       0
#9     9     1       1
#10   10     1       2
#11   11     0       1
#12   12     1       1
#13   13     2       0

答案 1 :(得分:4)

尝试一种方法,其中数据表既用于名称和日期列表,也用于电子邮件数量的来源。这是在data.table中使用DT的{​​{1}}参数iDT完成的。代码可能如下所示:

by = .EACHI

以下使用与上述相同的方法,但包含一些更改,可能会将速度提高30-40%,具体取决于您的数据。

library(data.table)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y") 
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
         .(type, date,
           Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
           Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),
         by=.EACHI]