我可以通过循环遍历我的数据集多次,但认为必须有一种更有效的方法来通过data.table执行此操作。 这就是数据集的样子:
CaseID Won OwnerID Time_period Finished
1 yes A 1 no
1 yes A 3 no
1 yes A 5 yes
2 no A 4 no
2 no A 6 yes
3 yes A 2 yes
4 no A 3 yes
5 15 B 2 no
对于每一行,按照所有者,我想生成在该时间段之前完成的案例数量的平均值。
CaseID Won OwnerID Time_period Finished AvgWonByOwner
1 yes A 1 no NA
1 yes A 3 no 1
1 yes A 5 yes .5
2 no A 4 no .5
2 no A 6 yes 2/3
3 yes A 2 yes NA
4 no A 3 yes 1
5 15 B 2 no NA
详细看这一点,看起来非常复杂。我认为你可以通过某种滚动合并来做到这一点,但我不知道如何设置一个条件,其中平均值仅在行的日期之前从Won计算,并且必须具有相同的ownerID
编辑1:最后一栏中的数字说明
AvgWonByOwner Explanation
NA t = 1, No cases finished yet, this could be 0 too
1 t = 3, case 3 finished and is won, so average wins is 1
.5 t = 5, case 3 finished, won; case 4 finished lost; average = .5
.5 t = 4, case 3 finished, won; case 4 finished lost; average = .5
2/3 t = 6, case 3 finished, won, case 4 finished lost, case 1 finished won, average: 2/3
NA t = 1, No cases finished yet, this could be 0 too
1 t = 3, case 3 finished and is won, so average wins is 1
NA t = 1, No cases finished yet, this could be 0 too
答案 0 :(得分:4)
dt = data.table(structure(list(CaseID = c(1, 1, 1, 2, 2, 3, 4, 5), Won = structure(c(3L,
3L, 3L, 2L, 2L, 3L, 2L, 1L), .Label = c("15", "no", "yes"), class = "factor"),
OwnerID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Time_period = c(1L, 3L, 5L, 4L,
6L, 2L, 3L, 2L), Finished = structure(c(1L, 1L, 2L, 1L, 2L,
2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), .Names = c("CaseID",
"Won", "OwnerID", "Time_period", "Finished"), row.names = c(NA,
-8L), class = c("data.table", "data.frame")))
# order
setkey(dt, OwnerID, Time_period)
# calculate the required ratio but including current time
dt[, ratio := cumsum(Finished == "yes" & Won == "yes") /
cumsum(Finished == "yes"),
by = list(OwnerID)]
# shift to satisfy the strict inequality as per OP
dt[, avgWon := c(NaN, ratio[-.N]), by = OwnerID]
# take the first one for each time (that is last one from previous time)
# so that all of the outcomes happening at same time are accounted for
dt[, avgWon := avgWon[1], by = key(dt)]
dt[order(OwnerID, CaseID)]
# CaseID Won OwnerID Time_period Finished ratio avgWon
#1: 1 yes A 1 no NaN NaN
#2: 1 yes A 3 no 1.0000000 1.0000000
#3: 1 yes A 5 yes 0.6666667 0.5000000
#4: 2 no A 4 no 0.5000000 0.5000000
#5: 2 no A 6 yes 0.5000000 0.6666667
#6: 3 yes A 2 yes 1.0000000 NaN
#7: 4 no A 3 yes 0.5000000 1.0000000
#8: 5 15 B 2 no NaN NaN
答案 1 :(得分:2)
## Compute a data.table recording the win percentage at end of each time period
B <- dt[Finished=="yes",]
B[,winpct := (cumsum(Won=="yes")/seq_along(Won)),by=OwnerID]
## Shift forward by one time step, as per OP's description of problem
B[,Time_period := Time_period + 1]
setkeyv(B, key(dt))
## Append win percentage column back to original data.table
cbind(dt, AvgWonByOwner=B[dt, winpct, roll=TRUE][["winpct"]])
# CaseID Won OwnerID Time_period Finished AvgWonByOwner
# 1: 1 yes A 1 no NA
# 2: 3 yes A 2 yes NA
# 3: 1 yes A 3 no 1.0000000
# 4: 4 no A 3 yes 1.0000000
# 5: 2 no A 4 no 0.5000000
# 6: 1 yes A 5 yes 0.5000000
# 7: 2 no A 6 yes 0.6666667
# 8: 5 15 B 2 no NA