假设我有两个变量:ID和日期:
ID <- c(1,1,1,2,2,2,3,3,3,4,4,4)
Datestr <- c("01/05/2014", "01/16/2014", "01/08/2014","07/05/2014", "07/01/2014", "07/02/2014", "02/05/2014", "02/11/2014", "02/02/2014","01/01/2014", "01/11/2014", "01/03/2014")
dates <- as.Date(Datestr, "%m/%d/%Y")
Mydata <- data.frame(ID, dates)
ID dates
1 1 2014-01-05
2 1 2014-01-16
3 1 2014-01-08
4 2 2014-07-05
5 2 2014-07-01
6 2 2014-07-02
7 3 2014-02-05
8 3 2014-02-11
9 3 2014-02-02
10 4 2014-01-01
11 4 2014-01-11
12 4 2014-01-03
现在,我需要删除重复项并将ID保留在最近的日期。
ID dates
1 1 2014-01-05
5 2 2014-07-01
9 3 2014-02-02
10 4 2014-01-01
答案 0 :(得分:4)
您可以使用aggregate
:
aggregate(dates ~ ID, Mydata, min)
ID dates
1 1 2014-01-05
2 2 2014-07-01
3 3 2014-02-02
4 4 2014-01-01
或者
library(dplyr)
group_by(Mydata, ID) %>% filter(min_rank(dates) == 1L)
#Source: local data frame [4 x 2]
#Groups: ID
#
# ID dates
#1 1 2014-01-05
#2 2 2014-07-01
#3 3 2014-02-02
#4 4 2014-01-01
或者
group_by(Mydata, ID) %>% slice(which.min(dates))
#Source: local data frame [4 x 2]
#Groups: ID
#
# ID dates
#1 1 2014-01-05
#2 2 2014-07-01
#3 3 2014-02-02
#4 4 2014-01-01
或者
group_by(Mydata, ID) %>% arrange(dates) %>% slice(1)
#Source: local data frame [4 x 2]
#Groups: ID
#
# ID dates
#1 1 2014-01-05
#2 2 2014-07-01
#3 3 2014-02-02
#4 4 2014-01-01
还有一个data.table选项:
library(data.table)
setDT(Mydata)[,.SD[which.min(dates)], ID]
# ID dates
#1: 1 2014-01-05
#2: 2 2014-07-01
#3: 3 2014-02-02
#4: 4 2014-01-01
答案 1 :(得分:4)
如果您只想要每个ID的第一个日期,则可以使用min
使用dplyr
library(dplyr)
group_by(Mydata, ID) %>% summarise(dates = min(dates))
# ID dates
# 1 1 2014-01-05
# 2 2 2014-07-01
# 3 3 2014-02-02
# 4 4 2014-01-01
或data.table
library(data.table)
as.data.table(Mydata)[, .(dates = min(dates)), by = ID][]
# ID dates
# 1: 1 2014-01-05
# 2: 2 2014-07-01
# 3: 3 2014-02-02
# 4: 4 2014-01-01
答案 2 :(得分:3)
使用duplicated
将是最有效的方法IMO
Mydata <- Mydata[order(Mydata$dates), ]
Mydata[!duplicated(Mydata$ID), ]
# ID dates
# 10 4 2014-01-01
# 1 1 2014-01-05
# 9 3 2014-02-02
# 5 2 2014-07-01
或使用data.table
s获得额外的效率增益
library(data.table)
unique(setorder(setDT(Mydata), dates), by = "ID")
# ID dates
# 1: 4 2014-01-01
# 2: 1 2014-01-05
# 3: 3 2014-02-02
# 4: 2 2014-07-01
或duplicated
setorder(setDT(Mydata), dates)[!duplicated(ID)]
# ID dates
# 1: 4 2014-01-01
# 2: 1 2014-01-05
# 3: 3 2014-02-02
# 4: 2 2014-07-01