Question

我正在尝试计算扩展窗口上的均值，但是数据结构使之成为可能，因此先前的答案至少缺少了所需要的一点（最接近的位置：link）。

我的数据如下：

  Company TimePeriod IndividualID Date.Indiv.Acted  Value 
1  1         2015          A           2015-01-01    400
2  1         2015          B           2015-02-01    200
3  1         2015          A           2015-06-15    400
4  1         2015          C           2015-07-12    300
5  1         2016          A           2016-07-15    400
6  1         2016          B           2016-08-09    100
7  1         2016          C           2016-09-10    400
8  1         2016          A           2016-10-11    100
9  2         2004          A           2004-07-12    200
10 2         2004          B           2004-08-12    300

我需要为每个Date.Indiv.Acted取按Company-TimePeriod的值的累积平均值。但是，我需要在保留最新副本的同时删除重复副本。因此，对于前两个平均值，没有问题-它们将包括第1行，第1行和第2行。但是，第1、2和3行应删除第1行，因为IndividualID是重复的。本质上，我具有预测信息，并且希望在每次均值计算中仅使用个人的最新预测。

所以我的最终数据看起来像（添加行以便于解释-数据中不需要）

  Company TimePeriod IndividualID Date.Indiv.Acted  Value CumMean 
1  1         2015          A           2015-01-01    400   400
2  1         2015          B           2015-02-01    200   300 (row 1 and 2)
3  1         2015          A           2015-06-15    400   300 (row 2 and 3)
4  1         2015          C           2015-07-12    300   300 (2,3,4)
5  1         2016          A           2016-07-15    400   400 (5)
6  1         2016          B           2016-08-09    100   250 (5,6)
7  1         2016          C           2016-09-10    400   300 (5,6,7)
8  1         2016          A           2016-10-11    100   200 (6,7,8)
9  2         2004          A           2004-07-12    200   200 (9)
10 2         2004          B           2004-08-12    300   250 (9,10)

一个data.table解决方案将是理想的，但是我并不挑剔，只要它可以在相当大的数据（大约2000万行）上运行并且直到宇宙热死为止才需要。

任何帮助，我们将不胜感激。

Answer 1

setDT(dt)
dt[, occ := 1:.N, by = .(Company, TimePeriod, IndividualID)]
dt[, n := cumsum(!duplicated(IndividualID)), by = .(Company, TimePeriod)]
dt[, Value1 := Value,]
dt[, x := c(0, diff(Value)), by = .(Company, TimePeriod, IndividualID)]
dt[occ>1, Value1 := x,]
dt[, Cummean := cumsum(Value1)/n, by = .(Company, TimePeriod)]
dt[, c("occ", "n", "Value1", "x") := NULL][]
#    Company TimePeriod IndividualID Date.Indiv.Acted Value Cummean
# 1:       1       2015            A       2015-01-01   400     400
# 2:       1       2015            B       2015-02-01   200     300
# 3:       1       2015            A       2015-06-15   400     300
# 4:       1       2015            C       2015-07-12   300     300
# 5:       1       2016            A       2016-07-15   400     400
# 6:       1       2016            B       2016-08-09   100     250
# 7:       1       2016            C       2016-09-10   400     300
# 8:       1       2016            A       2016-10-11   100     200
# 9:       2       2004            A       2004-07-12   200     200
#10:       2       2004            B       2004-08-12   300     250

dt <- structure(list(Company = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2), TimePeriod = c(2015, 
2015, 2015, 2015, 2016, 2016, 2016, 2016, 2004, 2004), IndividualID = c("A", 
"B", "A", "C", "A", "B", "C", "A", "A", "B"), Date.Indiv.Acted = c("2015-01-01", 
"2015-02-01", "2015-06-15", "2015-07-12", "2016-07-15", "2016-08-09", 
"2016-09-10", "2016-10-11", "2004-07-12", "2004-08-12"), Value = c(400, 
200, 400, 300, 400, 100, 400, 100, 200, 300)), row.names = c(NA, 
-10L), class = "data.frame")

Answer 2

我特别不喜欢循环，但是我认为这个循环足够简单，可以逐步理解。可以很容易地将其更改为运行任何其他指标，而不是平均值（例如累积方差）

# function that drops duplicates and calculates cumulative mean
fun.attempt <- function(dat, dup, value){
  #dat: data set
  #dup: string column to look for duplicates
  #value: string column to calculate the mean

  x <- dat[!duplicated(get(dup), fromLast = T), .(get(value))]

  y <- cumsum(x) / 1:nrow(x)

  y <- y[nrow(y)]
  return(y)
}

foo[, grp := .GRP, by = .(Company, TimePeriod)] # to create a more efficient loop
hl <- list() # as storage

for(k in unique(foo$grp)){

    got <- foo[grp == k] # running the cumulative mean for each grouping

    for(y in 1:nrow(got)){
      # applying customized function
      got[y, cummean2:= fun.attempt(got[1:y], 'IndividualID', 'Value')]

    }

    hl[[k]] <- got # storing the subsetted data.tables

}

现在只需编译data.tables的列表。 CumMean列是您的原始计算，cummean2是我的。

rbindlist(hl)
    Company TimePeriod IndividualID Date.Indiv.Acted Value CumMean grp cummean2
 1:       1       2015            A       2015-01-01   400     400   1      400
 2:       1       2015            B       2015-02-01   200     300   1      300
 3:       1       2015            A       2015-06-15   400     300   1      300
 4:       1       2015            C       2015-07-12   300     300   1      300
 5:       1       2016            A       2016-07-15   400     400   2      400
 6:       1       2016            B       2016-08-09   100     250   2      250
 7:       1       2016            C       2016-09-10   400     300   2      300
 8:       1       2016            A       2016-10-11   100     200   2      200
 9:       2       2004            A       2004-07-12   200     200   3      200
10:       2       2004            B       2004-08-12   300     250   3      250

累积（扩展窗口）按组平均，每次计算均重复检查

2 个答案: