data.table with rolling join计算平均条件日期

时间:2013-11-12 21:34:03

标签: r data.table

我可以通过循环遍历我的数据集多次,但认为必须有一种更有效的方法来通过data.table执行此操作。 这就是数据集的样子:

CaseID         Won     OwnerID      Time_period    Finished
  1            yes        A              1              no
  1            yes        A              3              no
  1            yes        A              5              yes
  2            no         A              4              no
  2            no         A              6              yes
  3            yes        A              2              yes
  4            no         A              3              yes
  5            15         B              2              no

对于每一行,按照所有者,我想生成在该时间段之前完成的案例数量的平均值。

CaseID         Won     OwnerID      Time_period     Finished     AvgWonByOwner  
  1            yes        A              1              no            NA
  1            yes        A              3              no             1
  1            yes        A              5              yes           .5
  2            no         A              4              no            .5
  2            no         A              6              yes           2/3
  3            yes        A              2              yes           NA
  4            no         A              3              yes           1
  5            15         B              2              no            NA

详细看这一点,看起来非常复杂。我认为你可以通过某种滚动合并来做到这一点,但我不知道如何设置一个条件,其中平均值仅在行的日期之前从Won计算,并且必须具有相同的ownerID

编辑1:最后一栏中的数字说明

AvgWonByOwner          Explanation
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
  .5                   t = 5, case 3 finished, won; case 4 finished lost; average = .5
  .5                   t = 4, case 3 finished, won; case 4 finished lost; average = .5
  2/3                  t = 6, case 3 finished, won, case 4 finished lost, case 1 finished won, average: 2/3
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
   NA                  t = 1, No cases finished yet, this could be 0 too

2 个答案:

答案 0 :(得分:4)

dt = data.table(structure(list(CaseID = c(1, 1, 1, 2, 2, 3, 4, 5), Won = structure(c(3L, 
3L, 3L, 2L, 2L, 3L, 2L, 1L), .Label = c("15", "no", "yes"), class = "factor"), 
    OwnerID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", 
    "B"), class = "factor"), Time_period = c(1L, 3L, 5L, 4L, 
    6L, 2L, 3L, 2L), Finished = structure(c(1L, 1L, 2L, 1L, 2L, 
    2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), .Names = c("CaseID", 
"Won", "OwnerID", "Time_period", "Finished"), row.names = c(NA, 
-8L), class = c("data.table", "data.frame")))

# order
setkey(dt, OwnerID, Time_period)

# calculate the required ratio but including current time
dt[, ratio := cumsum(Finished == "yes" & Won == "yes") /
              cumsum(Finished == "yes"),
     by = list(OwnerID)]

# shift to satisfy the strict inequality as per OP
dt[, avgWon := c(NaN, ratio[-.N]), by = OwnerID]

# take the first one for each time (that is last one from previous time)
# so that all of the outcomes happening at same time are accounted for
dt[, avgWon := avgWon[1], by = key(dt)]

dt[order(OwnerID, CaseID)]
#   CaseID Won OwnerID Time_period Finished     ratio    avgWon
#1:      1 yes       A           1       no       NaN       NaN
#2:      1 yes       A           3       no 1.0000000 1.0000000
#3:      1 yes       A           5      yes 0.6666667 0.5000000
#4:      2  no       A           4       no 0.5000000 0.5000000
#5:      2  no       A           6      yes 0.5000000 0.6666667
#6:      3 yes       A           2      yes 1.0000000       NaN
#7:      4  no       A           3      yes 0.5000000 1.0000000
#8:      5  15       B           2       no       NaN       NaN

答案 1 :(得分:2)

## Compute a data.table recording the win percentage at end of each time period
B <- dt[Finished=="yes",]
B[,winpct := (cumsum(Won=="yes")/seq_along(Won)),by=OwnerID]

## Shift forward by one time step, as per OP's description of problem
B[,Time_period := Time_period + 1]
setkeyv(B, key(dt))

## Append win percentage column back to original data.table
cbind(dt, AvgWonByOwner=B[dt, winpct, roll=TRUE][["winpct"]])
#    CaseID Won OwnerID Time_period Finished AvgWonByOwner
# 1:      1 yes       A           1       no            NA
# 2:      3 yes       A           2      yes            NA
# 3:      1 yes       A           3       no     1.0000000
# 4:      4  no       A           3      yes     1.0000000
# 5:      2  no       A           4       no     0.5000000
# 6:      1 yes       A           5      yes     0.5000000
# 7:      2  no       A           6      yes     0.6666667
# 8:      5  15       B           2       no            NA