我正在尝试找到第一个日期(每组),其中有一周和下一个记录。周不是在星期一开始,而是定义为七天。
假设某个日期是第一周的第一天,我正在尝试测试第二个“周”中的日期记录数是否大于一。
library(data.table)
dt=data.table(date=c(1,9,10,15,18,3,4,7,7,19,21,27),
group=c(rep("a", 5), rep("b",7)))
> dt
date group
1: 1 a
2: 9 a
3: 10 a
4: 15 a
5: 18 a
6: 3 b
7: 4 b
8: 7 b
9: 7 b
10: 19 b
11: 21 b
12: 27 b
可以在data.frame上工作的for循环如下所示:
df <- data.frame(dt)
for(i in 1:length(df$date)){
df$count[i] <- sum(df$date >= df$date[i] + 7 &
df$date < df$date[i] + 14 &
df$group == df$group[i])
}
> df
date group count
1 1 a 2
2 9 a 1
3 10 a 1
4 15 a 0
5 18 a 0
6 3 b 0
7 4 b 0
8 7 b 1
9 7 b 1
10 19 b 1
11 21 b 0
12 27 b 0
计数大于0的每组的第一个日期将给出第一周的开始日期,即组“a”中的1和组“b”中的7。
我的真实data.table有超过一千万行,所以理想情况下我想要一个类似于上面的for循环的函数,所以我可以这样做:
dt[, date/sum(date), by=group]
问题是我不明白如何使用适用于data.table的索引创建函数。非常感谢任何帮助。
答案 0 :(得分:3)
我认为这有效:
# set the key for the rolling merges
setkey(dt, group, date)
# find start and end point of the intervals you want
start = dt[J(group, date + 7 ), .I, roll = -Inf, by = .EACHI]$I
end = dt[J(group, date + 13), .I, roll = Inf, by = .EACHI]$I
# if start is 0, the first condition is not satisfied, so set count to 0
dt[, count := (start != 0) * (end - start + 1)]
dt
# date group count
# 1: 1 a 2
# 2: 9 a 1
# 3: 10 a 1
# 4: 15 a 0
# 5: 18 a 0
# 6: 3 b 0
# 7: 4 b 0
# 8: 7 b 1
# 9: 7 b 1
#10: 19 b 1
#11: 21 b 0
#12: 27 b 0
答案 1 :(得分:1)
不幸的是@eddi建议的解决方案不再适用于R 3.1.2
和data.table 1.9.4
。失败了这个错误:
Error in dt[J(group, date + 13), .I, roll = Inf]$.I :
$ operator is invalid for atomic vectors
以下代码有效,但使用新的foverlaps
函数是一种快速而肮脏的解决方法。我确定必须有办法修复滚动连接解决方案吗?
# Find start and end point of the intervals you want
dt[, start := date + 7]
dt[, end := date + 13]
# Make two data tables for overlapping dates.
dt2 <- dt[, c("group", "start", "end"), with=FALSE]
dt[, date2 := date] # copy date (foverlaps need an interval).
# Sort by date and overlap-merge with week ranges.
setkey(dt, group, date, date2)
dt3 <- foverlaps(dt2, dt, by.x=c("group", "start", "end"))
# Count unique values to get number of records in following week.
setkey(dt, group, start, end)
setkey(dt3, group, i.start, i.end)
dt4 <- unique(dt)[dt3]
dt4[, count := ifelse(is.na(i.start), 0L, length(unique(i.start))), by=date]
# Cleaning up.
dt5 <- dt[unique(dt4)]
dt5 <- dt5[, c("date", "group", "count"), with=FALSE]
# > dt5
# date group count
# 1: 1 a 2
# 2: 9 a 1
# 3: 10 a 1
# 4: 15 a 0
# 5: 18 a 0
# 6: 3 b 0
# 7: 4 b 0
# 8: 7 b 1
# 9: 7 b 1
#10: 19 b 1
#11: 21 b 0
#12: 27 b 0
对简单修复非常好奇,如果有的话。
答案 2 :(得分:-1)
为什么不直接使用你创建的循环?
dt[,count:=date]
for(i in 1:length(dt$date)){
set(dt,i,3L, sum(dt$date >= dt$date[i] + 7 &
dt$date < dt$date[i] + 14 &
dt$group == dt$group[i]))
}
dt
# date group count
# 1: 1 a 2
# 2: 9 a 1
# 3: 10 a 1
# 4: 15 a 0
# 5: 18 a 0
# 6: 3 b 0
# 7: 4 b 0
# 8: 7 b 1
# 9: 7 b 1
#10: 19 b 1
#11: 21 b 0
#12: 27 b 0
by
的工作方式与tapply
类似。您通过列中的变量(例如组)将data.table
溢出到mini data.tables中,对整个mini data.table执行函数,为每个mini data.table返回一些内容,然后将返回的内容组合到产生你的输出。