我想基于两个条件聚合data.table
,其中一个附加到另一行。这是我的问题和可重复的例子:
我有一对来源目的地。 对于每个来源,我想总结目标中的点数condition1
是否满足。但是,有两个棘手的问题。
condition2
时才应该总结这些点。也就是说,A-B
中的点数只能在condition1==T
和B-A
对condition2==T
library(data.table)
dt <- data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),
travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )
dt[ depart_time<=8 & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min
如果我只考虑condition1
来总结这些分数,这就是我得到的。请注意,此查询不涉及两个问题:(1)当有多个起始 - 目的地对满足condition1
时,它是双计数点,(2){{1}时不排除点数}不满意
condition2
dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]
> origin poits
> 1: A 20
> 2: B 11
> 3: C 15
我的真实数据框架大约是8000万行,所以我很感激有效的解决方案,可能基于> origin poits
> 1: A 9
> 2: B 7
> 3: C 12
。我意识到这是一个棘手的问题,我将不胜感激任何帮助。提前谢谢
这是具有时空约束的可访问性的时间地理学中的常见问题。问题是,根据您的时空限制,您可以选择多少个工作机会,例如,您居住在A区。 A区有5个工作,B区有4个工作,C区有3个工作,你有资格在所有工作中工作。但是,如果您可以在早上到办公室(data.table
),并且如果您可以在下午4点(condition1
)之后回到家中,那么您只能在工作岗位上工作。
答案 0 :(得分:3)
由于您只想计算一次每个组合,我建议您在destination到origin
和origin
到destination
) >两个条件下的唯一子集,然后简单地按原点对点进行求和。
我在解决此问题时遇到data.table
中的错误,因此setattr(res, "sorted", NULL)
行(将删除键)。此解决方法不会影响性能。 I've filled a bug report
setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]),
on = c(destination = "origin", origin = "destination"),
nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
# origin points
# 1: A 9
# 2: B 7
# 3: C 12