在编写有效函数以计算加权差值时,我需要您的帮助。最好使用data.table,因为我的数据集有100万条记录。
这是一个具有两个第1组和第2组的小df。它按从1到最大5的层顺序排列(也可以少于5层,例如group = 2只有3个)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)
df <- data.frame(group, layer, value, reference_value, size, cumsum_size)
我需要为每个组的不同目标尺寸计算与 reference_value 列的加权差。说,我有以下
target_sizes <- c(20, 50, 200, 500)
要计算加权差异,需要将target_size与cumsum_size进行比较以决定它在哪一层,因此,对于df $ group == 1,计算应为:
Case 1
target size 20
# it stays on layer 1 as 20 < 30
weighted impact is
abs(10.62-10.90)*20
Case 2
# target size 50
it takes 30 from layer 1 and additional 20 from layer 2
weighted impact is
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*20
Case 3
# target size 200
it takes 30 from layer 1, 120 from layer 2 and 50 from layer 3
weighted impact is
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*120 + abs(10.60 - 10.90)*50
Case 4
# target size 500
as the cumsum_size in the last layer 5 (460) less than 500
it should return NA
答案 0 :(得分:1)
这是一种可能的方法。
#since abs diff never changes, calculate it first
DT[, absDiff := abs(value - reference_value)]
#merge the unique set of group & cumsum_size with DT
dat <- merge(DT, DT[,CJ(isTarg=TRUE, group=unique(group), cumsum_size=target_sizes)],
by=c("group", "cumsum_size"), all=TRUE)
#backfill the absDiff using zoo::na.locf
dat[, absDiff := na.locf(absDiff, na.rm=FALSE, fromLast=TRUE), by=.(group)]
#calculate the desired results by group and filter for chosen target_sizes
dat[, result := cumsum(diff(c(0, cumsum_size)) * absDiff), by=.(group)][(isTarg)]
输出:
group cumsum_size layer value reference_value size absDiff isTarg result
1: 1 20 NA NA NA NA 0.28 TRUE 5.6
2: 1 50 NA NA NA NA 0.29 TRUE 14.2
3: 1 200 NA NA NA NA 0.30 TRUE 58.2
4: 1 500 NA NA NA NA NA TRUE NA
5: 2 20 NA NA NA NA 0.30 TRUE 6.0
6: 2 50 NA NA NA NA 0.30 TRUE 15.0
7: 2 200 NA NA NA NA 0.35 TRUE 63.3
8: 2 500 NA NA NA NA NA TRUE NA
数据和数据包:
library(data.table)
library(zoo)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)
DT <- data.table(group, layer, value, reference_value, size, cumsum_size)
target_sizes <- c(20, 50, 200, 500)