根据最近的日期R合并数据

时间:2018-03-07 16:17:14

标签: r date merge left-join

如何根据最近的日期jeft.join 2个数据框?我目前编写了脚本,以便它按照确切的日期加入,但我希望在最近的日期之前完成,以防没有完全匹配。

这就是我目前所拥有的:

MASTER_DATABASE <- left_join(ptnamesMID, CTDB, by = c("LAST_NAME", "FIRST_NAME", "Measure_date" = "VISIT_DATE"))

2 个答案:

答案 0 :(得分:1)

data.table中的滚动连接有一个参数roll = "nearest",它确实符合OP的预期。

不幸的是,OP未能提供样本数据,因此我不得不编制自己的样本数据。

创建样本数据集

set.seed(123L)
dates <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "1 day")
ptnamesMID <- data.frame(ID = 1:10, Measure_date = sample(dates, 10L))
CTDB <- data.frame(ID = sample.int(10, 30L, TRUE), VISIT_DATE = sample(dates, 30L, TRUE))

此处,ID用于代替LAST_NAMEFIRST_NAME以简化。 ptnamesMID由10行组成,而CTDB包含30行随机抽样访问日期。

滚动加入

library(data.table)
# coerce to data.table and append join columns to preserve the original columns 
setDT(CTDB)[, join_date := VISIT_DATE]
setDT(ptnamesMID)[, join_date := Measure_date]
# rolling join
CTDB[ptnamesMID, on = .(ID, join_date), roll = "nearest"]
    ID VISIT_DATE  join_date Measure_date
 1:  1 2017-06-20 2017-04-15   2017-04-15
 2:  2 2017-05-17 2017-10-14   2017-10-14
 3:  3 2017-06-10 2017-05-29   2017-05-29
 4:  4 2017-10-17 2017-11-16   2017-11-16
 5:  5 2017-06-13 2017-12-06   2017-12-06
 6:  6 2017-02-16 2017-01-17   2017-01-17
 7:  7 2017-07-24 2017-07-09   2017-07-09
 8:  8 2017-10-23 2017-12-28   2017-12-28
 9:  9 2017-02-20 2017-07-16   2017-07-16
10: 10 2017-08-31 2017-06-12   2017-06-12

data.table语法中,CTDB[ptnamesMID, ...]相当于ptnamesMIDCTDB的左连接,即ptnamesMID的所有行都保留在结果集。

答案 1 :(得分:0)

如果没有示例,很难帮助您使用案例。我会尝试David Robinson的一个包裹:

https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf

以下是interval_join的示例:

if (requireNamespace("IRanges", quietly = TRUE)) {
x1 <- data.frame(id1 = 1:3, start = c(1, 5, 10), end = c(3, 7, 15))
x2 <- data.frame(id2 = 1:3, start = c(2, 4, 16), end = c(4, 8, 20))
interval_inner_join(x1, x2)
# Allow them to be separated by a gap with a maximum:
interval_inner_join(x1, x2, maxgap = 1) # let 1 join with 2
interval_inner_join(x1, x2, maxgap = 20) # everything joins each other
# Require that they overlap by more than a particular amount
interval_inner_join(x1, x2, minoverlap = 3)
# other types of joins:
interval_full_join(x1, x2)
interval_left_join(x1, x2)
interval_right_join(x1, x2)
interval_semi_join(x1, x2)
interval_anti_join(x1, x2)
}