Question

我有一个如下所示的数据集：

df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c("2015-01-29","2015-02-02","2015-02-02","2015-02-02","2014-05-04","2014-05-04","2014-05-04")),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2014-03-01","2013-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2014-08-31","2013-05-01"))); 
names(df)=c('id','poi','start','end')

> df
  id        poi      start        end
1  1 2015-01-29 2010-10-01 2016-04-30
2  2 2015-02-02 2009-09-01 2013-12-31
3  2 2015-02-02 2014-01-01 2014-01-31
4  2 2015-02-02 2014-02-01 2016-04-30
5  3 2014-05-04 2009-01-01 2014-02-28
6  4 2014-05-04 2014-03-01 2014-08-31
7  4 2014-05-04 2013-03-01 2013-05-01

开始日期和结束日期是保险开始日期和结束日期，有时多个行的开始日期相同，因为它们适用于不同的保险类型。我有兴趣保留那些具有一致保险的ID poi之前和之后一年的报道。每个ID只能有1个poi。

我的输出将是在poi之前1年和1年之后有保险范围的ID列表。在这种情况下，它会排除ids 3和4，因为它们在poi后1年没有覆盖。

  ids=c(1,2)

我尝试过以下内容，但老实说我不知道如何实现我的目标。

任何帮助都将非常感激。

library(rehape2)
df.melt=melt(df,
             id=c("id","poi"))

df.melt=mutate(df.melt, flag=ave(id,id,variable,FUN=seq_along))
df.melt=mutate(df.melt, variable=paste(variable,flag,sep ="_"))
df.cast=dcast(df.melt, id+poi~variable)

Answer 1

如果您想使用dplyr和lubridate单独评估行：

library(dplyr)
library(lubridate)

# filter to only rows with a POI within the desired range
df %>% filter(poi - years(1) >= start, 
              poi + years(1) <= end)

#   id        poi      start        end
# 1  1 2015-01-29 2010-10-01 2016-04-30
# 2  2 2015-02-02 2014-02-01 2016-04-30

如果你更愿意评估一个ID的所有行，可能就像

# group to summarize IDs separately
df %>% group_by(id, poi) %>% 
    # collapse rows to min start and max end for each ID
    summarise(start = min(start), 
              end = max(end)) %>% 
    # filter to only rows with a POI within the desired range
    filter(poi - years(1) >= start, 
           poi + years(1) <= end)

# Source: local data frame [2 x 4]
# Groups: id [2]
# 
#      id        poi      start        end
#   (dbl)     (date)     (date)     (date)
# 1     1 2015-01-29 2010-10-01 2016-04-30
# 2     2 2015-02-02 2009-09-01 2016-04-30

如果可能的话，这种方法会忽略覆盖范围的差距。如果是这样，lubridate::interval和int_overlaps可能会在仔细缩小行数时发挥作用。

Answer 2

我认为这可以满足您的需求，但如果它没有，您应该只能使用大于或小于标志的游戏：

 df[(df$poi-df$start)/365>1&(df$end-df$poi)/365>1,]

 > df[(df$poi-df$start)/365>1&(df$end-df$poi)/365>1,]
   id        poi      start        end
 1  1 2015-01-29 2010-10-01 2016-04-30
 4  2 2015-02-02 2014-02-01 2016-04-30

这为你提供了两行df，它们可以保存你想要的值。

现在只是id：

 df$id[(df$poi-df$start)/365>1&(df$end-df$poi)/365>1]
 df$id[(df$poi-df$start)/365>1&(df$end-df$poi)/365>1]
 [1] 1 2

R折叠具有条件的日期字段

2 个答案: