根据另一行中的条件聚合data.table

时间:2016-05-15 14:44:29

标签: r dataframe data.table aggregate

我想基于两个条件聚合data.table,其中一个附加到另一行。这是我的问题和可重复的例子:

我有一对来源目的地。 对于每个来源,我想总结目标中的点数condition1是否满足。但是,有两个棘手的问题。

  1. 每个始发地 - 目的地对中的点只能合计一次
  2. 只有在反向通量中满足condition2时才应该总结这些点。也就是说,A-B中的点数只能在condition1==TB-Acondition2==T
  3. 的情况下求和。

    可重复的例子:

    library(data.table)
    dt <-  data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
                       destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
                       points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
                       depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),   
                       travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )
    
     dt[ depart_time<=8  & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
     dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min
    

    如果我只考虑condition1来总结这些分数,这就是我得到的。请注意,此查询不涉及两个问题:(1)当有多个起始 - 目的地对满足condition1时,它是双计数点,(2){{1}时不排除点数}不满意

    condition2

    期望的输出

    dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]
    
    >    origin poits
    > 1:      A    20
    > 2:      B    11
    > 3:      C    15
    

    我的真实数据框架大约是8000万行,所以我很感激有效的解决方案,可能基于> origin poits > 1: A 9 > 2: B 7 > 3: C 12 。我意识到这是一个棘手的问题,我将不胜感激任何帮助。提前谢谢

    背景

    这是具有时空约束的可访问性的时间地理学中的常见问题。问题是,根据您的时空限制,您可以选择多少个工作机会,例如,您居住在A区。 A区有5个工作,B区有4个工作,C区有3个工作,你有资格在所有工作中工作。但是,如果您可以在早上到办公室(data.table),并且如果您可以在下午4点(condition1)之后回到家中,那么您只能在工作岗位上工作。

1 个答案:

答案 0 :(得分:3)

由于您只想计算一次每个组合,我建议您在destination到originorigindestination) >两个条件下的唯一子集,然后简单地按原点对点进行求和。

我在解决此问题时遇到data.table中的错误,因此setattr(res, "sorted", NULL)行(将删除键)。此解决方法不会影响性能。 I've filled a bug report

setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]), 
                                on = c(destination = "origin", origin = "destination"), 
                                nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
#    origin points
# 1:      A      9
# 2:      B      7
# 3:      C     12