使用dplyr过滤分组数据集上的日期

时间:2018-06-20 10:01:27

标签: r filter dplyr

假设我有以下数据集:

library(dplyr)

name <- c("b", "a", "a", "b","b","a", "b", "c",  "c",  "c",  "c", "a")
class <- c(0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1)
date <- c("10-06-2018", "11-06-2018", "12-06-2018", "13-06-2018", "14-06-2018", "15-06-2018", "16-06-2018","17-06-2018", "18-06-2018", "19-06-2018", "20-06-2018", "21-06-2018")
dates <- as.Date(date, "%d/%m/%Y")
df <- data.frame(name, class, date)

df <- df %>%
  group_by(name) %>%
  arrange(date) %>%
  ungroup() %>%
  arrange(name)

我想过滤数据集,以便对于每个名称组,我具有班级0的最小日期和班级0之后的班级1的最小日期。在这种情况下,我将:

df.new <- df[c(2,3,5,6,9,11), ]

1 个答案:

答案 0 :(得分:0)

可能有一个更简洁的解决方案,但以下是一种解决方法

#split into two dataframes
# find the min dates for class == 0
df0 <- df %>%
 filter(class == 0) %>%
 group_by(name) %>%
 summarise(dates0 = min(dates))

# find min date of class == 1 that is coming after class == 0
# and join the two dataframes
df1 <- df %>%
 filter(class == 1) %>%
 select(-class) %>%
 left_join(df0, by = 'name')

# keep only the relevant dates     
df1 <- df1 %>%
 mutate(dates1 = ifelse(dates > dates0, 1, 0)) %>%
 filter(dates1 != 0) %>%
 group_by(name) %>%
 summarise(dates = min(dates)) %>%
 mutate(class = 1)

# combine the two dataframes into one with the correct dates
df <- df0 %>%
 mutate(class = 0) %>%
 rename(dates = dates0) %>%
 bind_rows(df1) %>%
 group_by(name) %>%
 arrange(dates) %>%
 ungroup() %>%
 arrange(name)