如何根据最近的日期jeft.join 2个数据框?我目前编写了脚本,以便它按照确切的日期加入,但我希望在最近的日期之前完成,以防没有完全匹配。
这就是我目前所拥有的:
MASTER_DATABASE <- left_join(ptnamesMID, CTDB, by = c("LAST_NAME", "FIRST_NAME", "Measure_date" = "VISIT_DATE"))
答案 0 :(得分:1)
data.table
中的滚动连接有一个参数roll = "nearest"
,它确实符合OP的预期。
不幸的是,OP未能提供样本数据,因此我不得不编制自己的样本数据。
set.seed(123L)
dates <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "1 day")
ptnamesMID <- data.frame(ID = 1:10, Measure_date = sample(dates, 10L))
CTDB <- data.frame(ID = sample.int(10, 30L, TRUE), VISIT_DATE = sample(dates, 30L, TRUE))
此处,ID
用于代替LAST_NAME
和FIRST_NAME
以简化。 ptnamesMID
由10行组成,而CTDB
包含30行随机抽样访问日期。
library(data.table)
# coerce to data.table and append join columns to preserve the original columns
setDT(CTDB)[, join_date := VISIT_DATE]
setDT(ptnamesMID)[, join_date := Measure_date]
# rolling join
CTDB[ptnamesMID, on = .(ID, join_date), roll = "nearest"]
ID VISIT_DATE join_date Measure_date 1: 1 2017-06-20 2017-04-15 2017-04-15 2: 2 2017-05-17 2017-10-14 2017-10-14 3: 3 2017-06-10 2017-05-29 2017-05-29 4: 4 2017-10-17 2017-11-16 2017-11-16 5: 5 2017-06-13 2017-12-06 2017-12-06 6: 6 2017-02-16 2017-01-17 2017-01-17 7: 7 2017-07-24 2017-07-09 2017-07-09 8: 8 2017-10-23 2017-12-28 2017-12-28 9: 9 2017-02-20 2017-07-16 2017-07-16 10: 10 2017-08-31 2017-06-12 2017-06-12
在data.table
语法中,CTDB[ptnamesMID, ...]
相当于ptnamesMID
与CTDB
的左连接,即ptnamesMID
的所有行都保留在结果集。
答案 1 :(得分:0)
如果没有示例,很难帮助您使用案例。我会尝试David Robinson的一个包裹:
https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf
以下是interval_join
的示例:
if (requireNamespace("IRanges", quietly = TRUE)) {
x1 <- data.frame(id1 = 1:3, start = c(1, 5, 10), end = c(3, 7, 15))
x2 <- data.frame(id2 = 1:3, start = c(2, 4, 16), end = c(4, 8, 20))
interval_inner_join(x1, x2)
# Allow them to be separated by a gap with a maximum:
interval_inner_join(x1, x2, maxgap = 1) # let 1 join with 2
interval_inner_join(x1, x2, maxgap = 20) # everything joins each other
# Require that they overlap by more than a particular amount
interval_inner_join(x1, x2, minoverlap = 3)
# other types of joins:
interval_full_join(x1, x2)
interval_left_join(x1, x2)
interval_right_join(x1, x2)
interval_semi_join(x1, x2)
interval_anti_join(x1, x2)
}