Question

我有一个约90,000行的数据集，个人可以在一个程序中有多个注册。例如;

id = c(1,1,3,3,5,5)
entry_date = c('2014-01-01', '2014-12-01', '2000-03-12', '2002-07-09', '2011-11-05','2016-12-01')
exit_date = c('2014-01-02', '2015-02-04', '2001-04-05', '2006-09-11', '2016-09-01', '2017-02-02')
test <- data.frame(id, entry_date, exit_date)
test


id entry_date   exit_date
1  2014-01-01  2014-01-02
1  2014-12-01  2015-02-04
3  2000-03-12  2001-04-05
3  2002-07-09  2006-09-11
5  2011-11-05  2016-09-01
5  2016-12-01  2017-02-02

我正在尝试对计划持续时间（entry_date和exit_date）包含2014年全部或部分时间的任何人进行分组。因此，根据示例数据，我希望包含以下所有内容行;

id  entry_date    exit_date
1   2014-01-01   2014-01-02 
1   2014-12-01   2015-02-04
5   2011-11-05   2016-09-01

感谢您的任何建议。

Answer 1

我想到的一种方法是从entry_date和exit_date提取年份，然后使用seq在它们之间创建mapply，并检查“2014”是否存在按顺序选择那些条目。

test[mapply(function(x, y) 2014 %in% seq(x,y) ,
 as.numeric(format(as.Date(test$entry_date), "%Y")), 
 as.numeric(format(as.Date(test$exit_date), "%Y"))), ]

#  id entry_date  exit_date
#1  1 2014-01-01 2014-01-02
#2  1 2014-12-01 2015-02-04
#5  5 2011-11-05 2016-09-01

Answer 2

我认为在将它们放入数据框之前，您应该将entry_date和exit_date分成c(year,month,day)。但无论如何，使用dplyr和tidyr：

library(dplyr)
library(tidyr)
test %>%
  separate(entry_date, c("entry_year","entry_month", "entry_day"), "-") %>%
  separate(exit_date, c("exit_year","exit_month","exit_day"),"-") %>%
  filter(entry_year <= 2014 & exit_year>=2014)

这给出了：

  id entry_year entry_month entry_day exit_year exit_month exit_day
1  1       2014          01        01      2014         01       02
2  1       2014          12        01      2015         02       04
3  5       2011          11        05      2016         09       01

Answer 3

虽然@RonakShah提供了一个非常智能的解决方案来解决问题。但是，由于OP提到大数据我想提到lubridate和data.table组合可以使它更快。

library(lubridate)
library(data.table)
setDT(test)

test[year(ymd(entry_date)) <= 2014 & year(ymd(exit_date)) >= 2014]
#   id entry_date  exit_date
#1:  1 2014-01-01 2014-01-02
#2:  1 2014-12-01 2015-02-04
#3:  5 2011-11-05 2016-09-01

子集基于两个日期变量内的日期范围

3 个答案: