根据日期和时间有条件地删除行

时间:2019-01-31 06:47:16

标签: r datetime

我正在尝试实现一种过滤此数据帧df

的方法
structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey", 
"Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20", 
"80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45", 
"11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15", 
"12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04", 
"11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")

我想对其进行过滤,以便如果Name列中有重复的响应,例如Jim两次出现,我想根据EndDate保留日期和时间最早的行仅当Progress列的值大于70时,才行。否则,我想在EndDate列中使用具有较晚日期和时间的行。

3 个答案:

答案 0 :(得分:3)

根据条件,我们将'EndDate'转换为@RouteAlias("")类,然后将RouteNotFoundError按'Name'转换为'{EndDate',再按'Name'DateTime分组,{{ 'Progres'的1}}元素大于70,返回索引1或arrange中的最后一个行索引if来对行进行子集

first

注意:如果有多种“ DateTime”格式,则一个选项是else而不是slice

答案 1 :(得分:3)

使用dplyr,我们首先使用EndDate中的parse_date_timelubridate转换为日期时间对象,然后group_by Name并用如果EndDate最小Progress > 70,并且每个Name的行数大于1,最大EndDate。如果Name仅存在一行,则默认情况下我们仅选择该行。

library(dplyr)
library(lubridate)

df %>%
  mutate(EndDate = parse_date_time(EndDate,c("%m-%d-%y %H:%M","%Y-%m-%d %H:%M:%S"))) %>%
  group_by(Name) %>%
  slice(ifelse(n() > 1, 
        ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))


#  Name   Progress EndDate            
#  <chr>  <chr>    <dttm>             
#1 Jane   70       2018-11-30 11:18:00
#2 Jim    25       2018-11-29 18:04:00
#3 Jose   80       2018-11-25 14:20:00
#4 Matt   20       2018-12-01 22:52:00
#5 Mickey 65       2018-11-29 18:15:00
#6 Peter  20       2018-11-26 12:07:00
#7 Tom    45       2018-12-02 15:27:00

答案 2 :(得分:1)

(当然)也可以使用data.table

样本数据

df <- structure(list(Name = c("Jim", "Jane", "Jose", "Matt", "Mickey", 
                        "Tom", "Peter", "Jane", "Jim", "Jose"), Progress = c("65", "20", 
                                                                             "80", "20", "65", "45", "20", "70", "25", "80"), EndDate = c("11/25/2018 16:45", 
                                                                                                                                          "11/25/2018 18:05", "11/25/2018 14:20", "12/1/2018 22:52", "11/29/2018 18:15", 
                                                                                                                                          "12/2/2018 15:27", "11/26/2018 12:07", "11/30/2018 11:18", "11/29/2018 18:04", 
                                                                                                                                          "11/29/2018 21:12")), row.names = c(NA, -10L), class = "data.frame")

代码

#create the data.table (can also be done using setDT(df) )
dt <- as.data.table( df )
#set the dates to a proper POSIXct-format
dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
#order omn EndDate (by reference!)
setorder( dt, EndDate )
#summarise by Name, if first Progress >70 then keep it, else keep last Progress
dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]

基准

microbenchmark::microbenchmark(
  data.table = {
    dt[, EndDate := as.POSIXct( EndDate, format = "%m/%d/%Y %H:%M") ]
    setorder( dt, EndDate )
    dt[ , list( Progress = ifelse( Progress[1] > 70, Progress[1], Progress[.N] ) ), by = .(Name)][]
  },
  tidyverse1 = {
    df %>%  
      mutate(EndDate = mdy_hm(EndDate)) %>%
      arrange(Name, EndDate) %>% 
      group_by(Name) %>%
      slice(if(first(Progress) > 70) 1 else n())
  },
  tidyverse2 = {
    df %>%
      mutate(EndDate = mdy_hm(EndDate)) %>%
      group_by(Name) %>%
      slice(ifelse(n() > 1, 
                   ifelse(any(Progress > 70), which.min(EndDate), which.max(EndDate)), 1))
  }
)


# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
# data.table 1.654241 2.030820 2.709023 2.556978 2.782023 30.36590   100
# tidyverse1 6.847731 7.218286 8.742247 7.516838 8.034861 72.00902   100
# tidyverse2 6.173201 6.506398 7.286639 6.764582 7.088591 52.10180   100