在第一个(最小)日期过滤

时间:2015-01-28 22:01:25

标签: r dplyr

我的数据大致如下:

  Snap Date   ID   Stage
1 2014-01-01  A1   One     
2 2014-01-02  A1   One 
3 2014-01-03  A1   One 
4 2014-01-04  A1   Two
5 2014-01-05  A1   Two
6 2014-01-01  B9   One 
7 2014-01-02  B9   One 
8 2014-01-03  B9   Two 
9 2014-01-04  B9   Three

如何过滤Stage实际更改的条目,并删除其中的所有内容。

期望的输出:

  Snap Date   ID   Stage  
1 2014-01-01  A1   One 
4 2014-01-04  A1   Two
6 2014-01-01  B9   One 
8 2014-01-03  B9   Two 
9 2014-01-04  B9   Three

此外,如果要筛选多个列,解决方案可能会如何变化?

   Snap Date   ID   Stage  Colour
1  2014-01-01  A1   One    Red 
2  2014-01-02  A1   One    Red
3  2014-01-03  A1   One    Green
4  2014-01-04  A1   One    Green
5  2014-01-05  A1   Two    Green
6  2014-01-06  A1   Two    Green
7  2014-01-07  A1   Two    Blue
8  2014-01-08  A1   Two    Blue
9  2014-01-09  A1   Three  Blue
10 2014-01-10  A1   Three  Blue   
11 2014-01-11  A1   Four   Blue
12 2014-01-12  A1   Four   Blue
13 2014-01-13  A1   Four   Blue
14 2014-01-14  A1   Four   Blue
15 2014-01-15  A1   Four   Blue
16 2014-01-04  B9   One    Green
17 2014-01-05  B9   One    Green
18 2014-01-06  B9   Two    Green
19 2014-01-07  B9   Three  Green

2 个答案:

答案 0 :(得分:4)

dplyr的另一个选择是:

DF %>%
  mutate(Snap.Date = as.Date(Snap.Date)) %>%  # make sure the dates are formatted correct
  group_by(ID, Stage, Colour) %>%             # group the data
  slice(which.min(Snap.Date))                 # slice off only those rows with the (first) minimum date per group

#Source: local data frame [9 x 4]
#Groups: ID, Stage, Colour
#
#   Snap.Date ID Stage Colour
#1 2014-01-11 A1  Four   Blue
#2 2014-01-03 A1   One  Green
#3 2014-01-01 A1   One    Red
#4 2014-01-09 A1 Three   Blue
#5 2014-01-07 A1   Two   Blue
#6 2014-01-05 A1   Two  Green
#7 2014-01-04 B9   One  Green
#8 2014-01-07 B9 Three  Green
#9 2014-01-06 B9   Two  Green

此方法不需要提前对数据进行排序。

答案 1 :(得分:3)

您可以使用 data.table unique函数及其by属性,您可以随意更新。

原始问题

library(data.table)
unique(setDT(df), by = c("ID", "Stage"))
#    Snap       Date ID Stage
# 1:    1 2014-01-01 A1   One
# 2:    4 2014-01-04 A1   Two
# 3:    6 2014-01-01 B9   One
# 4:    8 2014-01-03 B9   Two
# 5:    9 2014-01-04 B9 Three

对于Edit3:只需colorby参数

unique(df, by = c("ID", "Stage", "Colour"))
#    Snap       Date ID Stage Colour
# 1:    1 2014-01-01 A1   One    Red
# 2:    3 2014-01-03 A1   One  Green
# 3:    5 2014-01-05 A1   Two  Green
# 4:    7 2014-01-07 A1   Two   Blue
# 5:    9 2014-01-09 A1 Three   Blue
# 6:   11 2014-01-11 A1  Four   Blue
# 7:   16 2014-01-04 B9   One  Green
# 8:   18 2014-01-06 B9   Two  Green
# 9:   19 2014-01-07 B9 Three  Green

其他选项正在使用which.min(就像您提到的那样)

df[, .SD[which.min(Date)], .(ID, Stage, Colour)]

或使用dplyr

library(dplyr)
distinct(df, ID, Stage, Colour)