我正在尝试确定一个数据框中的多个日期是否在另一个数据框的多个日期范围内。应在每个ID中比较日期和日期范围。然后我想用来自第二个数据帧的信息更新第一个数据帧的数据。对于每个ID,两个数据帧都可能具有0到多个记录。例如,df1可能如下所示:
UID1 ID Date
1 1 05/12/10
2 1 07/25/11
3 1 07/31/12
4 2 11/04/03
5 2 10/06/04
6 3 10/07/08
7 3 06/16/12
虽然df2看起来像这样(注意ID = 2在df2中没有记录):
UID2 ID StartDate EndDate
1 1 07/22/09 09/13/09
2 1 03/19/10 11/29/10
3 1 05/09/11 09/04/11
4 3 05/18/12 08/15/12
5 3 01/15/13 04/21/13
我想最终得到一个看起来像这样的新df1:
UID1 ID Date UID2 InRange DaysSinceStart
1 1 05/12/10 2 TRUE 54
2 1 07/25/11 3 TRUE 77
3 1 07/31/12 NA FALSE NA
4 2 11/04/03 NA FALSE NA
5 2 10/06/04 NA FALSE NA
6 3 10/07/08 NA FALSE NA
7 3 06/16/12 4 TRUE 29
建议?
答案 0 :(得分:2)
建议使用data.table
。内联解释。
数据:强>
dt1 <- fread("
UID1 ID Date
1 1 05/12/10
2 1 07/25/11
3 1 07/31/12
4 2 11/04/03
5 2 10/06/04
6 3 10/07/08
7 3 06/16/12
")[, Date:=as.Date(Date, "%m/%d/%y")]
cols <- c("StartDate", "EndDate")
dt2 <- fread("
UID2 ID StartDate EndDate
1 1 07/22/09 09/13/09
2 1 03/19/10 11/29/10
3 1 05/09/11 09/04/11
4 3 05/18/12 08/15/12
5 3 01/15/13 04/21/13
")[, (cols) := lapply(.SD, function(x) as.Date(x, "%m/%d/%y")), .SDcols=cols]
从这里开始工作:
#left join dt1 with dt2
dt <- dt2[dt1, on="ID", allow.cartesian=TRUE]
#check date range, get unique row
res <- dt[, {
if (!all(is.na(StartDate <= Date & Date <= EndDate)) &&
any(StartDate <= Date & Date <= EndDate)) {
#case where Date within a range
chosen <- StartDate <= Date & Date <= EndDate
list(UID2=UID2[chosen], StartDate=StartDate[chosen])
} else {
list(UID2=NA_integer_, StartDate=as.Date(NA))
}
}, by=c("UID1","ID","Date")]
#count DaysSinceStart
res[, ':=' (InRange=!is.na(UID2),
DaysSinceStart=as.numeric(Date - StartDate))][,
StartDate:=NULL]
res