计算过去30天的按日期滚动汇总的有效方法

时间:2019-01-10 03:21:01

标签: r loops optimization aggregate

我的代码执行得很好,但是要花大量的时间才能完成。如果可能的话,希望有一些优化代码的方法,以便在多列上执行滚动聚合。

我一直在尝试其他几种方法,方法是创建一个函数并使用library(data.table)对数据框进行矢量化处理,但是这样做没有成功,实际上我得到了应该得到的一半,而我只能做到一次一列。

#   Creating functions
fun <- function(x, date, thresh) {
    D <- as.matrix(dist(date)) #distance matrix between dates
    D <- D <= thresh
    D[lower.tri(D)] <- FALSE #don't sum to future
    R <- D * x #FALSE is treated as 0
    colMeans(R, na.rm = TRUE)
}

setDT(df_2)
df_2[, invoiceDate := as.Date(invoiceDate, format = "%m/%d/%Y")]
setkey(df_2, cod_unb, cod_pdv, invoiceDate)

df_2[, volume_total_diario_RT30 := fun(volume_total_diario, invoiceDate, 30), by = list(cod_unb, cod_pdv)]

这是我当前的代码,可以正常工作,但是需要太多时间(超过8小时才能处理30天)

years <- c(2017:2019)
months <- c(1:12)
days <- c(1:31)

df_final <- df_n[1,c('cod_unb','cod_pdv','cpf_cnpj','idade_pdv_meses','status_telefone','col1','col2','col3','year','month','day')] #eliminating first line

for (i in years) {
    for (j in months) {
        for (k in days) {
            if (j == 1){
                df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == (i-1) & df_n$months == 12 & df_n$days >= k),]    
            }                                    
            if (j != 1){                                   
                df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == i & df_n$months == (j - 1) & df_n$days >= k),] 
            }

            #Agreggate.
            if(nrow(df_temp) >= 1){
df_temp <- aggregate(df_temp[, c('col1','col2','col3')], by = list(df_temp$cod_unb,df_temp$cod_pdv,df_temp$cpf_cnpj,df_temp$idade_pdv_meses,df_temp$status_telefone), FUN = mean)

names(df_temp)[names(df_temp) == "Group.1"] <- "cod_unb"
names(df_temp)[names(df_temp) == "Group.2"] <- "cod_pdv"
names(df_temp)[names(df_temp) == "Group.3"] <- "cpf_cnpj"
names(df_temp)[names(df_temp) == "Group.4"] <- "idade_pdv_meses"
names(df_temp)[names(df_temp) == "Group.5"] <- "status_telefone"

        df_temp$years <- i         
        df_temp$months <- j
        df_temp$days <- k        
        df_final <- rbind(df_final,df_temp)
            }                                
        }                       
    }           
}

df_final <- df_final[-1,]

输出应为列R30

cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547

1 个答案:

答案 0 :(得分:1)

如果我理解正确,则OP要求在30天的滚动期内汇总值并将这些汇总附加到原始数据中。

可以通过聚集非等额联接有效地解决此问题。

以下是使用OP提供的示例数据的一个变量的示例:

library(data.table)
# coerce to data.table, coerce character date to class IDate
setDT(df_n)[, date := as.IDate(date, "%d/%m/%Y")]
# intermediate result for demonstration:
df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower), 
     mean(volume_total_diario), by = .EACHI]
          date       date       V1
 1: 2018-11-03 2018-10-04 0.480000
 2: 2018-11-09 2018-10-10 0.635175
 3: 2018-11-16 2018-10-17 0.863800
 4: 2018-11-24 2018-10-25 0.808200
 5: 2018-11-30 2018-10-31 0.766560
 6: 2018-12-07 2018-11-07 1.028910
 7: 2018-12-15 2018-11-15 1.159260
 8: 2018-12-21 2018-11-21 0.991050
 9: 2018-12-28 2018-11-28 0.973470
10: 2019-01-04 2018-12-05 0.925470

中间结果显示汇总中包含的日期范围的上限和下限,以及各个期间的汇总值。可以用来向df_n添加新列:

# update df_n by appending new column
setDT(df_n)[, R30_new := df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower), 
                       mean(volume_total_diario), by = .EACHI]$V1]
df_n
    cod_unb cod_pdv Years Months Days       date volume_total_diario     R30  R30_new
 1:     111    1005  2018     11    3 2018-11-03             0.48000      NA 0.480000
 2:     111    1005  2018     11    9 2018-11-09             0.79035      NA 0.635175
 3:     111    1005  2018     11   16 2018-11-16             1.32105      NA 0.863800
 4:     111    1005  2018     11   24 2018-11-24             0.64140      NA 0.808200
 5:     111    1005  2018     11   30 2018-11-30             0.60000      NA 0.766560
 6:     111    1005  2018     12    7 2018-12-07             1.79175 1.02891 1.028910
 7:     111    1005  2018     12   15 2018-12-15             1.44210 1.15926 1.159260
 8:     111    1005  2018     12   21 2018-12-21             0.48000 0.99105 0.991050
 9:     111    1005  2018     12   28 2018-12-28             0.55350 0.97347 0.973470
10:     111    1005  2019      1    4 2019-01-04             0.36000 0.92547 0.925470

R30R30_new的值相同; R30_new还包含前5行的结果。

注意事项

为清楚起见,其他分组变量已被忽略,但可以轻松包含。而且,该解决方案可以扩展为聚合多个值列。

数据

library(data.table)
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")

编辑:聚合多个变量

由于OP要求一种在多列上执行滚动聚合的方法,这是一个示例。

首先,我们需要在OP的示例数据集中创建一个附加值var:

df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")[
  , date := as.IDate(date, "%d/%m/%Y")][, var2 := .I][]
df_n
   cod_unb cod_pdv Years Months Days       date volume_total_diario     R30 var2
 1:     111    1005  2018     11    3 2018-11-03             0.48000      NA    1
 2:     111    1005  2018     11    9 2018-11-09             0.79035      NA    2
 3:     111    1005  2018     11   16 2018-11-16             1.32105      NA    3
 4:     111    1005  2018     11   24 2018-11-24             0.64140      NA    4
 5:     111    1005  2018     11   30 2018-11-30             0.60000      NA    5
 6:     111    1005  2018     12    7 2018-12-07             1.79175 1.02891    6
 7:     111    1005  2018     12   15 2018-12-15             1.44210 1.15926    7
 8:     111    1005  2018     12   21 2018-12-21             0.48000 0.99105    8
 9:     111    1005  2018     12   28 2018-12-28             0.55350 0.97347    9
10:     111    1005  2019      1    4 2019-01-04             0.36000 0.92547   10

因此,已添加了列var2(仅包含行号)。

这是使用相同的聚集函数聚集多个列的代码:

cols <- c("volume_total_diario", "var2")
setDT(df_n)[, paste0("mean_", cols) := 
       df_n[.(upper = date, lower = date - 30), 
            on = .(date <= upper, date >= lower), 
            lapply(.SD, mean), 
            .SDcols = cols, by = .EACHI][
              , .SD, .SDcols = cols]][]
df_n
    cod_unb cod_pdv Years Months Days       date volume_total_diario     R30 var2 mean_volume_total_diario mean_var2
 1:     111    1005  2018     11    3 2018-11-03             0.48000      NA    1                 0.480000       1.0
 2:     111    1005  2018     11    9 2018-11-09             0.79035      NA    2                 0.635175       1.5
 3:     111    1005  2018     11   16 2018-11-16             1.32105      NA    3                 0.863800       2.0
 4:     111    1005  2018     11   24 2018-11-24             0.64140      NA    4                 0.808200       2.5
 5:     111    1005  2018     11   30 2018-11-30             0.60000      NA    5                 0.766560       3.0
 6:     111    1005  2018     12    7 2018-12-07             1.79175 1.02891    6                 1.028910       4.0
 7:     111    1005  2018     12   15 2018-12-15             1.44210 1.15926    7                 1.159260       5.0
 8:     111    1005  2018     12   21 2018-12-21             0.48000 0.99105    8                 0.991050       6.0
 9:     111    1005  2018     12   28 2018-12-28             0.55350 0.97347    9                 0.973470       7.0
10:     111    1005  2019      1    4 2019-01-04             0.36000 0.92547   10                 0.925470       8.0

请注意,新列已以编程方式命名。