此代码生成类似于我自己的数据集:
df <- c(seq(as.Date("2012-01-01"), as.Date("2012-01-10"), "days"))
df <- as.data.frame(df)
df <- rbind(df, df)
id <- c(rep.int(1, 10), rep.int(2, 10))
id <- as.data.frame(id)
cnt <- c(1:3, 0, 0, 4, 5:8, 0, 1, 0, 1:7)
cnt <- as.data.frame(cnt)
df <- cbind(id, df, cnt)
names(df) <- c("id", "date", "cnt")
df$date[df$date == "2012-01-10"] <- "2012-01-20"
我正在尝试查找过去7天内发生的变量'cnt'的总和。有时日期不是连续的(参见前面'df'中的最后一个日期) - 按id。
这是循环:
system.time(
for(i in 1:length(df$date)) {
df$cnt.weekly[i] <-
sum(df$cnt[which((df$date == df$date[i] - 1) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 2) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 3) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 4) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 5) & df$id == df$id[i])],
df$cnt[which((df$date == df$date[i] - 6) & df$id == df$id[i])])})
我最终在一个800万行data.frame(数千个ID)上运行它,所以虽然这里的玩具很快但实际上它很慢。
我在代码的其他部分使用data.table包非常好运,但我无法弄清楚如何让它在这里工作。也许lapply在data.table中?
提前致谢!
答案 0 :(得分:5)
怎么样:
> DT = as.data.table(df)
> DT
id date cnt
[1,] 1 2012-01-01 1
[2,] 1 2012-01-02 2
[3,] 1 2012-01-03 3
[4,] 1 2012-01-04 0
[5,] 1 2012-01-05 0
[6,] 1 2012-01-06 4
[7,] 1 2012-01-07 5
[8,] 1 2012-01-08 6
[9,] 1 2012-01-09 7
[10,] 1 2012-01-20 8
[11,] 2 2012-01-01 0
[12,] 2 2012-01-02 1
[13,] 2 2012-01-03 0
[14,] 2 2012-01-04 1
[15,] 2 2012-01-05 2
[16,] 2 2012-01-06 3
[17,] 2 2012-01-07 4
[18,] 2 2012-01-08 5
[19,] 2 2012-01-09 6
[20,] 2 2012-01-20 7
然后在群内累积。这个步骤目前很难看,但:=
按组(很快将在1.8.1中)将整理它。
> DT[,cumcnt:=DT[,cumsum(cnt),by=id][[2]]]
id date cnt cumcnt
[1,] 1 2012-01-01 1 1
[2,] 1 2012-01-02 2 3
[3,] 1 2012-01-03 3 6
[4,] 1 2012-01-04 0 6
[5,] 1 2012-01-05 0 6
[6,] 1 2012-01-06 4 10
[7,] 1 2012-01-07 5 15
[8,] 1 2012-01-08 6 21
[9,] 1 2012-01-09 7 28
[10,] 1 2012-01-20 8 36
[11,] 2 2012-01-01 0 0
[12,] 2 2012-01-02 1 1
[13,] 2 2012-01-03 0 1
[14,] 2 2012-01-04 1 2
[15,] 2 2012-01-05 2 4
[16,] 2 2012-01-06 3 7
[17,] 2 2012-01-07 4 11
[18,] 2 2012-01-08 5 16
[19,] 2 2012-01-09 6 22
[20,] 2 2012-01-20 7 29
现在加入到7天前,允许不定期:
> setkey(DT,id,date)
> DT[,before7dayago:=DT[SJ(id,date-7),cumcnt,roll=TRUE,mult="last"]]
id date cnt cumcnt before7dayago
[1,] 1 2012-01-01 1 1 NA
[2,] 1 2012-01-02 2 3 NA
[3,] 1 2012-01-03 3 6 NA
[4,] 1 2012-01-04 0 6 NA
[5,] 1 2012-01-05 0 6 NA
[6,] 1 2012-01-06 4 10 NA
[7,] 1 2012-01-07 5 15 NA
[8,] 1 2012-01-08 6 21 1
[9,] 1 2012-01-09 7 28 3
[10,] 1 2012-01-20 8 36 28
[11,] 2 2012-01-01 0 0 NA
[12,] 2 2012-01-02 1 1 NA
[13,] 2 2012-01-03 0 1 NA
[14,] 2 2012-01-04 1 2 NA
[15,] 2 2012-01-05 2 4 NA
[16,] 2 2012-01-06 3 7 NA
[17,] 2 2012-01-07 4 11 NA
[18,] 2 2012-01-08 5 16 0
[19,] 2 2012-01-09 6 22 1
[20,] 2 2012-01-20 7 29 22
最后从另一个中减去一个。
> DT[,`7daysum`:=cumcnt-before7dayago]
id date cnt cumcnt before7dayago 7daysum
[1,] 1 2012-01-01 1 1 NA NA
[2,] 1 2012-01-02 2 3 NA NA
[3,] 1 2012-01-03 3 6 NA NA
[4,] 1 2012-01-04 0 6 NA NA
[5,] 1 2012-01-05 0 6 NA NA
[6,] 1 2012-01-06 4 10 NA NA
[7,] 1 2012-01-07 5 15 NA NA
[8,] 1 2012-01-08 6 21 1 20
[9,] 1 2012-01-09 7 28 3 25
[10,] 1 2012-01-20 8 36 28 8
[11,] 2 2012-01-01 0 0 NA NA
[12,] 2 2012-01-02 1 1 NA NA
[13,] 2 2012-01-03 0 1 NA NA
[14,] 2 2012-01-04 1 2 NA NA
[15,] 2 2012-01-05 2 4 NA NA
[16,] 2 2012-01-06 3 7 NA NA
[17,] 2 2012-01-07 4 11 NA NA
[18,] 2 2012-01-08 5 16 0 16
[19,] 2 2012-01-09 6 22 1 21
[20,] 2 2012-01-20 7 29 22 7
那应该非常快。