我的代码执行得很好,但是要花大量的时间才能完成。如果可能的话,希望有一些优化代码的方法,以便在多列上执行滚动聚合。
我一直在尝试其他几种方法,方法是创建一个函数并使用library(data.table)对数据框进行矢量化处理,但是这样做没有成功,实际上我得到了应该得到的一半,而我只能做到一次一列。
# Creating functions
fun <- function(x, date, thresh) {
D <- as.matrix(dist(date)) #distance matrix between dates
D <- D <= thresh
D[lower.tri(D)] <- FALSE #don't sum to future
R <- D * x #FALSE is treated as 0
colMeans(R, na.rm = TRUE)
}
setDT(df_2)
df_2[, invoiceDate := as.Date(invoiceDate, format = "%m/%d/%Y")]
setkey(df_2, cod_unb, cod_pdv, invoiceDate)
df_2[, volume_total_diario_RT30 := fun(volume_total_diario, invoiceDate, 30), by = list(cod_unb, cod_pdv)]
这是我当前的代码,可以正常工作,但是需要太多时间(超过8小时才能处理30天)
years <- c(2017:2019)
months <- c(1:12)
days <- c(1:31)
df_final <- df_n[1,c('cod_unb','cod_pdv','cpf_cnpj','idade_pdv_meses','status_telefone','col1','col2','col3','year','month','day')] #eliminating first line
for (i in years) {
for (j in months) {
for (k in days) {
if (j == 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == (i-1) & df_n$months == 12 & df_n$days >= k),]
}
if (j != 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == i & df_n$months == (j - 1) & df_n$days >= k),]
}
#Agreggate.
if(nrow(df_temp) >= 1){
df_temp <- aggregate(df_temp[, c('col1','col2','col3')], by = list(df_temp$cod_unb,df_temp$cod_pdv,df_temp$cpf_cnpj,df_temp$idade_pdv_meses,df_temp$status_telefone), FUN = mean)
names(df_temp)[names(df_temp) == "Group.1"] <- "cod_unb"
names(df_temp)[names(df_temp) == "Group.2"] <- "cod_pdv"
names(df_temp)[names(df_temp) == "Group.3"] <- "cpf_cnpj"
names(df_temp)[names(df_temp) == "Group.4"] <- "idade_pdv_meses"
names(df_temp)[names(df_temp) == "Group.5"] <- "status_telefone"
df_temp$years <- i
df_temp$months <- j
df_temp$days <- k
df_final <- rbind(df_final,df_temp)
}
}
}
}
df_final <- df_final[-1,]
输出应为列R30
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
答案 0 :(得分:1)
如果我理解正确,则OP要求在30天的滚动期内汇总值并将这些汇总附加到原始数据中。
可以通过聚集非等额联接有效地解决此问题。
以下是使用OP提供的示例数据的一个变量的示例:
library(data.table)
# coerce to data.table, coerce character date to class IDate
setDT(df_n)[, date := as.IDate(date, "%d/%m/%Y")]
# intermediate result for demonstration:
df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]
date date V1 1: 2018-11-03 2018-10-04 0.480000 2: 2018-11-09 2018-10-10 0.635175 3: 2018-11-16 2018-10-17 0.863800 4: 2018-11-24 2018-10-25 0.808200 5: 2018-11-30 2018-10-31 0.766560 6: 2018-12-07 2018-11-07 1.028910 7: 2018-12-15 2018-11-15 1.159260 8: 2018-12-21 2018-11-21 0.991050 9: 2018-12-28 2018-11-28 0.973470 10: 2019-01-04 2018-12-05 0.925470
中间结果显示汇总中包含的日期范围的上限和下限,以及各个期间的汇总值。可以用来向df_n
添加新列:
# update df_n by appending new column
setDT(df_n)[, R30_new := df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]$V1]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 R30_new 1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 0.480000 2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 0.635175 3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 0.863800 4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 0.808200 5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 0.766560 6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 1.028910 7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 1.159260 8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 0.991050 9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 0.973470 10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 0.925470
R30
和R30_new
的值相同; R30_new
还包含前5行的结果。
为清楚起见,其他分组变量已被忽略,但可以轻松包含。而且,该解决方案可以扩展为聚合多个值列。
library(data.table)
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")
由于OP要求一种在多列上执行滚动聚合的方法,这是一个示例。
首先,我们需要在OP的示例数据集中创建一个附加值var:
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")[
, date := as.IDate(date, "%d/%m/%Y")][, var2 := .I][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2 1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1 2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2 3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3 4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4 5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5 6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6 7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7 8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8 9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9 10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10
因此,已添加了列var2
(仅包含行号)。
这是使用相同的聚集函数聚集多个列的代码:
cols <- c("volume_total_diario", "var2")
setDT(df_n)[, paste0("mean_", cols) :=
df_n[.(upper = date, lower = date - 30),
on = .(date <= upper, date >= lower),
lapply(.SD, mean),
.SDcols = cols, by = .EACHI][
, .SD, .SDcols = cols]][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2 mean_volume_total_diario mean_var2 1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1 0.480000 1.0 2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2 0.635175 1.5 3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3 0.863800 2.0 4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4 0.808200 2.5 5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5 0.766560 3.0 6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6 1.028910 4.0 7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7 1.159260 5.0 8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8 0.991050 6.0 9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9 0.973470 7.0 10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10 0.925470 8.0
请注意,新列已以编程方式命名。