我正在使用包含data.table的R,我想通过运行(时间)间隔或重叠的bin来对data.table进行分组。对于这些运行间隔中的每一个,我想找到相等数据对的出现。 更多这些"相等的数据对"应该不完全相等,但在某些间隔范围内也是如此。
问题的简单版本如下:
#Time X Y Counts
# ... ... ... 1
#I would like to do:
DT[, sum(counts), by = list(Time, X, Y)]
#with Time, X and Y being in overlapping intervals.
findintervals()
会给我带有"硬边框"的垃圾箱,而不是重叠的。
更详细的问题: 我们假设我有一个data.table:
Time <- c(1,1,2,4,5,5,6,7,8,8,8,8,12,13)
#more equal time values are allowed.
X <- c(6,6,7,10,5,7,6,3,9,10,6,3,3,6)
Y <- c(2,6,10,3,4,6,6,9,4,9,6,6,9,9)
DT <- data.table(Time, X, Y)
Time X Y
1: 1 6 2
2: 1 6 6
3: 2 7 10
4: 4 10 3
5: 5 5 4
6: 5 7 6
7: 6 6 6
8: 7 3 9
9: 8 9 4
10: 8 10 9
11: 8 6 6
12: 8 3 6
13: 12 3 9
14: 13 6 9
一些预定义的间隔大小:
Timeinterval <- 5
#for a time value of 10 this means to look from 10-5 to 10+5
RangeX.percentage <- 0.5
RangeY.percentage <- 0.5
结果应该给我一个额外的专栏,让我们称之为&#34;计数&#34;考虑到X和Y的范围,同时出现数据对X和Y.
我想过按时间间隔进行某种分组,比如
c(1, 1, 2, 4, 5, 5, 6) #for the first item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7) # for the second item: (1-5):(1+5)
c(1, 1, 2, 4, 5, 5, 6, 7, 8, 8, 8, 8) #for the third item (2-5):(2+5)
#...
c(8, 8, 8, 8, 12, 13) # for the last item (13-5):(13+5)
以及数据的以下条件(但也许该部分有一个更简单的版本):
编辑:要清除结果应该是什么样的:
Ranges <- DT[ , list(
X* (1 + RangeX.percentage), X* (1 - RangeX.percentage),
Y* (1 + RangeY.percentage), Y* (1 - RangeY.percentage))]
DT2 <- cbind(DT, Ranges, count = rep(1, nrow(DT)))
setnames(DT2, c("Time","X","Y","XR1","XR2","YR1","YR2","count"))
for (i in 1:nrow(DT2)){
#main part of the question how to get this done within data.table:
DT2.subset <- DT2[which(abs(Time - DT2[i]$Time) < Timeinterval)]
#subsequent comparison of X and Y:
DT[i]$Count<- length(which(DT2.subset$X < DT2[i]$XR1 &
DT2.subset$X > DT2[i]$XR2 &
DT2.subset$Y < DT2[i]$YR1 &
DT2.subset$Y > DT2[i]$YR2))
}
DT2
Time X Y XR1 XR2 YR1 YR2 count
1: 1 6 2 9.0 3.0 3.0 1.0 1
2: 1 6 6 9.0 3.0 9.0 3.0 3
3: 2 7 10 10.5 3.5 15.0 5.0 4
4: 4 10 3 15.0 5.0 4.5 1.5 3
5: 5 5 4 7.5 2.5 6.0 2.0 1
6: 5 7 6 10.5 3.5 9.0 3.0 6
7: 6 6 6 9.0 3.0 9.0 3.0 4
8: 7 3 9 4.5 1.5 13.5 4.5 2
9: 8 9 4 13.5 4.5 6.0 2.0 3
10: 8 10 9 15.0 5.0 13.5 4.5 4
11: 8 6 6 9.0 3.0 9.0 3.0 4
12: 8 3 6 4.5 1.5 9.0 3.0 1
13: 12 3 9 4.5 1.5 13.5 4.5 2
14: 13 6 9 9.0 3.0 13.5 4.5 1
由于我的完整data.table包含超过一百万行,因此检查每行的所有DT $时间在计算时间方面是一团糟。
答案 0 :(得分:4)
您可以尝试data.table::foverlaps
。
我们将像您一样创建Ranges
,只需添加Time
范围和行索引(以便稍后聚合)。这里的主要问题是你不希望&lt; =或&gt; =而是&lt;和&gt;,所以我们必须在Time
间隔加上+ -1。然后,我们也会向Time
添加DT
间隔,键,然后运行foverlaps
。最后阶段是计算每行的观察结果。
DT[, Time2 := Time] ## Add higher interval to DT
setkey(DT, Time, Time2) ## key (for foverlaps)
Ranges <-
DT[ , .(Time = Time - Timeinterval + 1, ## Add lower Time interval
Time2 = Time + Timeinterval - 1, ## Add higher Time interval
XR1 = X* (1 - RangeX.percentage),
XR2 = X* (1 + RangeX.percentage),
YR1 = Y* (1 - RangeY.percentage),
YR2 = Y* (1 + RangeY.percentage),
indx = .I)] ## Add row index
# Run foverlaps and count incidences by condition while updating DT by reference
DT[,
count := foverlaps(Ranges, DT)[X > XR1 & X < XR2 & Y > YR1 & Y < YR2,
.N,
keyby = indx]$N]
DT
# Time X Y Time2 count
# 1: 1 6 2 1 1
# 2: 1 6 6 1 3
# 3: 2 7 10 2 4
# 4: 4 10 3 4 3
# 5: 5 5 4 5 1
# 6: 5 7 6 5 6
# 7: 6 6 6 6 4
# 8: 7 3 9 7 2
# 9: 8 9 4 8 3
# 10: 8 10 9 8 4
# 11: 8 6 6 8 4
# 12: 8 3 6 8 1
# 13: 12 3 9 12 2
# 14: 13 6 9 13 1