我有一个数据框,如下所示:
Date Pulled Date Col3 Col4
2019-01-19 2019-01-17 8 9
2019-01-19 2019-01-18 14 9
2019-01-20 2019-01-18 8 0
2019-01-20 2019-01-18 15 14
2019-01-18 2019-01-17 18 7
我要写逻辑说明-
每当Date Pulled
的值不同并且给定Date Pulled
的值不同时,相应的Date
的值在该列中相同,我只想保留最大Date Pulled
中的值。
Date Pulled Date Col3 Col4
2019-01-19 2019-01-17 8 9
2019-01-20 2019-01-18 8 0
2019-01-20 2019-01-18 15 14
就上下文而言,我每天要提取有7天的数据。如果我一起查找结果,将有重复的日期(因此有重复的Date列)。我只想保留我提取的最新报告,因此保留最大提取日期。
答案 0 :(得分:2)
假设“ Col1”和“ Col2”是Date
类,按“ Col2”和filter
分组,其中“ Col1”等于“ Col1”的max
的行
library(dplyr)
df1 %>%
group_by(Col2) %>%
filter((Col1 == max(Col1) )
# A tibble: 3 x 4
# Groups: Col2 [2]
# Col1 Col2 Col3 Col4
# <date> <date> <int> <int>
#1 2019-01-19 2019-01-17 8 9
#2 2019-01-20 2019-01-18 8 0
#3 2019-01-20 2019-01-18 15 14
df1 <- structure(list(Col1 = structure(c(17915, 17915, 17916, 17916,
17914), class = "Date"), Col2 = structure(c(17913, 17914, 17914,
17914, 17913), class = "Date"), Col3 = c(8L, 14L, 8L, 15L, 18L
), Col4 = c(9L, 9L, 0L, 14L, 7L)), row.names = c(NA, -5L), class = "data.frame")
答案 1 :(得分:1)
我只想保留我提交的最新报告,因此最多保留“提取日期”。
这似乎可行:
inner_join(
DT,
DT %>% group_by(Date) %>% summarise(Pulled = max(Pulled))
)
Joining, by = c("Pulled", "Date")
Pulled Date Col3 Col4
1 2019-01-19 2019-01-17 8 9
2 2019-01-20 2019-01-18 8 0
3 2019-01-20 2019-01-18 15 14
其中
DT = structure(list(Pulled = c("2019-01-19", "2019-01-19", "2019-01-20",
"2019-01-20", "2019-01-18"), Date = c("2019-01-17", "2019-01-18",
"2019-01-18", "2019-01-18", "2019-01-17"), Col3 = c(8L, 14L,
8L, 15L, 18L), Col4 = c(9L, 9L, 0L, 14L, 7L)), row.names = c(NA,
-5L), class = "data.frame")
(也就是说,我不必费心转换为日期类。)