Question

我有一个包含3列的数据，大致如下所示：

uid <- c(1,1,1,1,1,1,2,2,2)
sale <- c(0,1,1,0,0,0,0,1,0)
e <- as.data.frame(cbind(uid, sale))
e$uid <- as.factor(e$uid)
e$sincesale <- NA

对于每个唯一ID，我想应用相同的程序 - 计算自上次销售以来的天数。

我可以很容易地想出可以做到这一点的for循环。问题是我有数百万行。因此，完成此过程需要太多时间。我想在tapply上使用e$uid。但是，tapply仅接受向量作为输入。

可以使用什么替代方案（比for-loop更快）？

我的for-loop：

for (i in 2:length(e$uid)){
  #working within the good with the same unique id (uid)
  if (e$uid[i] == e$uid[i-1]){
    if (e$sale[i]==1){
      sincesale[i] <- sincesale[i-1]+1
    }
    if (e$sale[i]==0){
      #if sale just ended, number of days since sale is 1
      if (e$sale[i-1]==1){
        e$sincesale[i] <- 1
      }
      #if sale ended a few periods ago add 1 to previous value of "sincesale"
      if (e$sale[i-1] == 0){
        e$sincesale[i] <- e$sincesale[i-1] + 1
      }
    }
  }
}

UPD：

好吧，老实说，我在昨晚和早上试着独自工作，但无法找到解决新问题的方法。我尝试使用建议的方法，但一个小问题是他们开始计算＆＃34; sincesale＆＃34;从第一行开始（因为即使销售从头开始，销售== 0对于第一行也是如此）。以下示例输入使用for循环生成结果（＆＃34; sincesale＆＃34;）并使用建议的dplyr（＆＃34; sincesale4＆＃34;）：

uid <- c(1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4)
sale <- c(0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0)
e <- as.data.frame(cbind(uid, sale))
e$uid <- as.factor(e$uid)

   uid sale first sincesale sincesale4
1    1    0     1        NA          0
2    1    0     1        NA          1
3    1    1     0        NA          1
4    1    0     0         1          2
5    1    0     0         2          3
6    1    0     0         3          4
7    2    0     1        NA          0
8    2    1     1        NA          0
9    2    0     0         1          1
10   2    1     0        NA          1
11   3    0     1        NA          0
12   3    0     1        NA          1
13   3    0     0        NA          2
14   3    0     0        NA          3
15   3    0     0        NA          4
16   3    0     0        NA          5
17   3    1     0        NA          5
18   3    1     0        NA          5
19   3    0     0         1          6
20   4    0     1        NA          0
21   4    0     1        NA          1
22   4    0     0        NA          2

Answer 1

使用ave查看每个uid组，并获得非销售日的累积总和cumsum：

e$sincesale2 <- ave(!e$sale, e$uid, FUN=cumsum)-1

#  uid sale sincesale sincesale2
#1   1    0        NA          0
#2   1    1        NA          0
#3   1    1        NA          0
#4   1    0         1          1
#5   1    0         2          2
#6   1    0         3          3
#7   2    0        NA          0
#8   2    1        NA          0
#9   2    0         1          1

翻译为data.table这将是：

library(data.table)
setDT(e)
e[, sincesale3 := cumsum(!sale)-1, by=uid]

或者dplyr以及给@RonakShah的帽子提示：

library(dplyr)
e %>%
  group_by(uid) %>%
  mutate(sincesale4 = cumsum(!sale)-1)

将for-loop转换为-apply函数，其中input是数据帧而不是vector

1 个答案: