在data.table

时间:2018-07-12 07:49:28

标签: r data.table

在编写有效函数以计算加权差值时,我需要您的帮助。最好使用data.table,因为我的数据集有100万条记录。

这是一个具有两个第1组和第2组的小df。它按从1到最大5的层顺序排列(也可以少于5层,例如group = 2只有3个)

group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)

df <- data.frame(group, layer, value, reference_value, size, cumsum_size)

我需要为每个组的不同目标尺寸计算与 reference_value 列的加权差。说,我有以下

target_sizes <- c(20, 50, 200, 500)

要计算加权差异,需要将target_size与cumsum_size进行比较以决定它在哪一层,因此,对于df $ group == 1,计算应为:

Case 1
target size 20

# it stays on layer 1 as 20 < 30
  weighted impact is  

  abs(10.62-10.90)*20

Case 2
# target size 50
it takes 30 from layer 1 and additional 20 from layer 2

weighted impact is 
  abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*20


Case 3
  # target size 200  
it takes 30 from layer 1, 120 from layer 2 and 50 from layer 3  

weighted impact is 
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*120 + abs(10.60 - 10.90)*50


Case 4
# target size 500  

as the cumsum_size in the last layer 5 (460) less than 500
it should return NA

1 个答案:

答案 0 :(得分:1)

这是一种可能的方法。

#since abs diff never changes, calculate it first
DT[, absDiff := abs(value - reference_value)]

#merge the unique set of group & cumsum_size with DT
dat <- merge(DT, DT[,CJ(isTarg=TRUE, group=unique(group), cumsum_size=target_sizes)], 
    by=c("group", "cumsum_size"), all=TRUE)

#backfill the absDiff using zoo::na.locf
dat[, absDiff := na.locf(absDiff, na.rm=FALSE, fromLast=TRUE), by=.(group)]

#calculate the desired results by group and filter for chosen target_sizes
dat[, result := cumsum(diff(c(0, cumsum_size)) * absDiff), by=.(group)][(isTarg)]

输出:

   group cumsum_size layer value reference_value size absDiff isTarg result
1:     1          20    NA    NA              NA   NA    0.28   TRUE    5.6
2:     1          50    NA    NA              NA   NA    0.29   TRUE   14.2
3:     1         200    NA    NA              NA   NA    0.30   TRUE   58.2
4:     1         500    NA    NA              NA   NA      NA   TRUE     NA
5:     2          20    NA    NA              NA   NA    0.30   TRUE    6.0
6:     2          50    NA    NA              NA   NA    0.30   TRUE   15.0
7:     2         200    NA    NA              NA   NA    0.35   TRUE   63.3
8:     2         500    NA    NA              NA   NA      NA   TRUE     NA

数据和数据包:

library(data.table)
library(zoo)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)

DT <- data.table(group, layer, value, reference_value, size, cumsum_size)
target_sizes <- c(20, 50, 200, 500)