我创建了以下2个虚拟数据集:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
请注意,只有部分数据中的id包含在ns中,反之亦然。
对于data $ id中的每个值,我试图找到数据$ date之前14天的ns $ date,其中data $ id == ns $ id并报告天数差异。
我需要的输出是相同数量的数据行的向量/列(&#34;接收&#34;),带有TRUE / FALSE和$ date [ns $ id == data $ id]在相应的数据$ date和类似的向量之前不到14天,其中包含&#34;收到的实际天数&#34;是真的。我希望现在有道理。
这是我到目前为止的地方
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
此链接Find if date is within 14 days if id matches between datasets in R提供了一个想法,但结果不包括tmp $ dif的天数差异。
在结果表中,我只需要每个数据$ id的最小差异,因为收到的tmp $为TRUE。
希望现在更有意义吗?如果没有,请告诉我需要进一步澄清的内容。 中号
PS:根据要求我添加了所需的输出应该是什么样的(相同的数据行数= 10 - ns中的数据没有行而不是数据)。本来应该认为这可能会有所帮助。
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
答案 0 :(得分:0)
这是data.table
方法
转换为data.table
个对象
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
合并
ns <- ns[data]
转换为Date
类
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
计算天差和TRUE
/ FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
只选择我们感兴趣的行
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
按id
和date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
按最小dstance进行子集
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days