Question

是否有快速方法可以在R中的数据框架中删除x年的数据。我希望每个ID丢弃前1年。我的数据按id和日期排序，其中日期id为彼此相隔一个月。我目前正在考虑的方式是以某种方式为每个id创建一个从1到N的计数，然后将N = 1放到12，但我想知道是否有更好的方法以防万一我的数据包含一些丢失的日期。

例如，我的数据可能如下所示：

id | date
__________
 a | 2009-01-01
 a | 2009-02-01
 a | 2009-03-01
 a | 2009-04-01
 a | 2009-05-01
 a | 2009-06-01
 a | 2009-07-01
 a | 2009-08-01
 a | 2009-09-01
 a | 2009-10-01
 a | 2009-11-01
 a | 2009-12-01
 a | 2010-01-01
 a | 2010-02-01
 a | 2010-03-01
 b | 2003-07-01
 b | 2003-08-01
 b | 2003-09-01
 b | 2003-10-01
 b | 2003-11-01
 b | 2003-12-01
 b | 2004-01-01
 b | 2004-02-01
 b | 2004-03-01
 b | 2004-04-01
 b | 2004-05-01
 b | 2004-06-01
 b | 2004-07-01
 b | 2004-08-01
 c | 2007-03-01

我的愿望输出是删除每个id的第一年数据：

id | date
__________
 a | 2010-01-01
 a | 2010-02-01
 a | 2010-03-01
 b | 2004-07-01
 b | 2004-08-01

Answer 1

使用基数R：

# attach the year (as.Date might not be needed if yours is already a date)
df$year <- format(as.Date(df$date),format = '%Y')

# attach the minimum year for each id
df$minyear <- ave(x = df$year,df$id,FUN = min)

# subset by the minyear variable
dfnew <- df[df$year != df$minyear, ]

<强>更新

哦，我看到的不是第一年的数据，而是第一年的一年内的数据。使用lubridate使这很容易。

# add year to date
require(lubridate)
df$addyear <- ymd(df$date) %m+% years(1)

# find minimum cutoff date for each id
df$mindate <- ave(x = df$addyear,df$id,FUN = min)

# subset by mindate
dfnew <- df[df$date >= df$mindate, ]

Answer 2

容易发生的事情：

df = read.csv(text="id,date
 a,2009-01-01
 a,2009-02-01
 a,2009-03-01
 a,2009-04-01
 a,2009-05-01
 a,2009-06-01
 a,2009-07-01
 a,2009-08-01
 a,2009-09-01
 a,2009-10-01
 a,2009-11-01
 a,2009-12-01
 a,2010-01-01
 a,2010-02-01
 a,2010-03-01
 b,2003-07-01
 b,2003-08-01
 b,2003-09-01
 b,2003-10-01
 b,2003-11-01
 b,2003-12-01
 b,2004-01-01
 b,2004-02-01
 b,2004-03-01
 b,2004-04-01
 b,2004-05-01
 b,2004-06-01
 b,2004-07-01
 b,2004-08-01
 c,2007-03-01")


library(lubridate)
df$date <- ymd(df$date)

library(dplyr)
df %>% group_by(id) %>% filter(year(date) > min(year(date)))
#>    id       date
#> 1   a 2010-01-01
#> 2   a 2010-02-01
#> 3   a 2010-03-01
#> 4   b 2004-01-01
#> 5   b 2004-02-01
#> 6   b 2004-03-01
#> 7   b 2004-04-01
#> 8   b 2004-05-01
#> 9   b 2004-06-01
#> 10  b 2004-07-01
#> 11  b 2004-08-01

Answer 3

我使用ARobertson的代码开始实现我想要的结果的代码

df$year <- format(df$date, format = '%Y')
df$minyear <- ave(x = df$year,df$id,FUN = min)

d <- as.POSIXlt(as.Date(df$minyear))
d$year <- d$year + 1
df$cutoff_date <- as.Date(d)

df$date <- as.Date(df$date)
dfnew <- df[df$date >= df$cutoff_date, ]

按R中的id删除第一年的数据

3 个答案: