R中合并与聚合的组合

时间:2014-08-12 07:57:40

标签: r

我创建了以下2个虚拟数据集:

id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))

id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))

请注意,只有部分数据中的id包含在ns中,反之亦然。

对于data $ id中的每个值,我试图找到数据$ date之前14天的ns $ date,其中data $ id == ns $ id并报告天数差异。

我需要的输出是相同数量的数据行的向量/列(&#34;接收&#34;),带有TRUE / FALSE和$ date [ns $ id == data $ id]在相应的数据$ date和类似的向量之前不到14天,其中包含&#34;收到的实际天数&#34;是真的。我希望现在有道理。

这是我到目前为止的地方

# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)

# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and  tmp$date as date.y

# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)

此链接Find if date is within 14 days if id matches between datasets in R提供了一个想法,但结果不包括tmp $ dif的天数差异。

在结果表中,我只需要每个数据$ id的最小差异,因为收到的tmp $为TRUE。

希望现在更有意义吗?如果没有,请告诉我需要进一步澄清的内容。 中号

PS:根据要求我添加了所需的输出应该是什么样的(相同的数据行数= 10 - ns中的数据没有行而不是数据)。本来应该认为这可能会有所帮助。

   id         date   received     dif
1   8   2011-03-03       TRUE       6
2   8   2011-12-12      FALSE      NA
3   50  2010-08-18       TRUE       1
4   87  2009-04-28       TRUE       0
5   141 2010-11-29         NA      NA
6   161 2012-04-02         NA      NA
7   192 2013-01-08         NA      NA
8   216 2007-01-22         NA      NA
9   257 2009-06-03         NA      NA
10  282 2009-12-02       TRUE       4

1 个答案:

答案 0 :(得分:0)

这是data.table方法

转换为data.table个对象

library(data.table)
setkey(setDT(data), id) 
setkey(setDT(ns), id)

合并

ns <- ns[data]

转换为Date

ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]

计算天差和TRUE / FALSE

ns[, `:=`(timediff = date.1 - date,
          Logical = (date.1 - date) < 14)]

只选择我们感兴趣的行

res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]

iddate

排序
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)

按最小dstance进行子集

res[, list(diff = min(dif)), by = list(id, date.1, received)]

#      id     date.1 received    diff
#  1:   8 2011-03-03     TRUE  6 days
#  2:   8 2011-12-12    FALSE NA days
#  3:  50 2010-08-18     TRUE  1 days
#  4:  87 2009-04-28     TRUE  0 days
#  5: 141 2010-11-29       NA NA days
#  6: 161 2012-04-02       NA NA days
#  7: 192 2013-01-08       NA NA days
#  8: 216 2007-01-22       NA NA days
#  9: 257 2009-06-03       NA NA days
# 10: 282 2009-12-02     TRUE  4 days