按时间间隔重叠合并数据表

时间:2015-10-25 00:58:40

标签: r merge group-by data.table dplyr

假设我有两张桌子。一个有约会,另一个有招待会。每张表都有孝顺ID,医生ID,开始和结束时间(约会计划和接待事实)以及其他一些数据。我想计算约会中有多少约会在预约期间内接受。接待事实可以在预约开始时间之前开始,之后,它可以在app内。间隔等。

下面我做了两张桌子。一个用于约会,一个用于接待。我写了嵌套循环,但它的工作速度非常慢。我的表包含大约50 mio行。我需要快速解决这个问题。我怎么能没有循环呢?提前谢谢!

header dir/Test1.h
header dir/Test2.h
header dir/Test3.h
src dir/Test1.cpp
src dir/Test2.cpp
src dir/Test3.cpp
#include <dir/Test1.h>
#include <dir/Test2.h>
#include <dir/Test3.h>

2 个答案:

答案 0 :(得分:2)

使用foverlaps()

setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
#     med.id filial.id          start.time            end.time           A count
#  1:      1       100 2015-01-01 14:30:00 2015-01-01 15:29:59  0.60878560     1
#  2:      2       100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284     0
#  3:      3       100 2015-01-01 16:30:00 2015-01-01 17:29:59  0.68992084     1
#  4:      4       100 2015-01-01 17:30:00 2015-01-01 18:29:59  0.04703938     1
#  5:      5       100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419     0
#  6:      6       200 2015-01-01 14:30:00 2015-01-01 15:29:59  0.26193554     0
#  7:      7       200 2015-01-01 15:30:00 2015-01-01 16:29:59  1.55206077     1
#  8:      8       200 2015-01-01 16:30:00 2015-01-01 17:29:59  0.44517362     0
#  9:      9       200 2015-01-01 17:30:00 2015-01-01 18:29:59  0.11475881     0
# 10:     10       200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828     0

PS:请完成vignettes并学会有效使用数据表。

答案 1 :(得分:1)

我实际上根本不认为您需要按时间重叠进行合并:您的代码实际上是按med.idfilial.id合并然后进行简单比较。

首先,为了清楚起见,我们重命名start.timeend.time字段:

setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))

然后您应该在键med.idfilial.id上合并两个data.tables,如下所示:

app_re <- re[app, on=c("med.id", "filial.id")]
#    med.id filial.id       re.start.time         re.end.time          B
# 1:      1       100 2015-01-01 14:25:00 2015-01-01 15:25:00  0.4307760
# 2:      2       100                <NA>                <NA>         NA
# 3:      3       100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
# 4:      4       100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
# 5:      5       100                <NA>                <NA>         NA
# 6:      6       200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
# 7:      7       200 2015-01-01 15:50:00 2015-01-01 16:12:00  2.5742241
# 8:      8       200                <NA>                <NA>         NA
# 9:      9       200                <NA>                <NA>         NA
# 10:    10       200                <NA>                <NA>         NA
#          app.start.time        app.end.time           A
# 1:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
# 2:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.24246341
# 3:  2015-01-01 16:30:00 2015-01-01 17:29:59  1.55824948
# 4:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.25829302
# 5:  2015-01-01 18:30:00 2015-01-01 19:29:59  1.14244558
# 6:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
# 7:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.07710022
# 8:  2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
# 9:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.21682394
# 10: 2015-01-01 18:30:00 2015-01-01 19:29:59  1.11197318

然后,您可以使用与之前相同的条件创建计数变量:

app_re[, count := 
  as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) | 
    (re.start.time < app.end.time & re.start.time > app.start.time)]
# Convert the NAs to 0
app_re[, count := ifelse(is.na(count), 0, count)]

这应该比for循环快得多。