根据多种过滤条件(R,dplyr)创建时间戳列

时间:2020-02-05 17:26:05

标签: r dplyr lubridate

我有一个数据集df

 Read      Box       ID      Time                             Subject 
 T         out               10/1/2019 9:00:01 AM
 T         out               10/1/2019 9:00:02 AM             Re:
 T         out               10/1/2019 9:00:03 AM             Re:
 T         out               10/1/2019 9:02:59 AM             Re:
 T         out               10/1/2019 9:03:00 AM
 F                           10/1/2019 9:05:00 AM
 T         out               10/1/2019 9:06:00 AM             Fwd:
 T         out               10/1/2019 9:06:02 AM             Fwd:
 T         in                10/1/2019 9:07:00 AM
 T         in                10/1/2019 9:07:02 AM
 T         out               10/1/2019 9:07:04 AM
 T         out               10/1/2019 9:07:05 AM             Fw:
 T         out               10/1/2019 9:07:06 AM             Fw:
           hello             10/1/2019 9:07:08 AM

根据此数据集中的某些条件,我想创建一个startime列和一个endtime列。

在发生以下情况时,我想创建一个“开始时间”:如果“主题”列的第一个单词以RE:,re,FWD或FW(以连续方式)开头,则Read ==“ T”,Box = =“ out”和ID ==“”

第一次出现这种情况时,将生成一个开始时间。例如,对于此数据集,开始时间将为10/1/2019 9:00:02 AM,因为这是我们首先看到所需条件的位置(主题为FW:,RE:或FWD,Read = T,Box =出,ID =“”“) 但是,当这些条件中的任何一个都不成立时,将创建结束时间。因此,第一个结束时间将发生在第4行之前,该时间为10/1/2019 9:02:59 AM。我的最终目标是为此创建一个工期列。

当包含RE,Fwd和Fw时,这是我想要的输出

  starttime                    endtime                     duration

  10/1/2019 9:00:02 AM        10/1/2019 9:02:59 AM         177 secs
  10/1/2019 9:06:00 AM        10/1/2019 9:06:02 AM         2 secs
  10/1/2019 9:07:05 AM        10/1/2019 9:07:06 AM         1 secs

此外,我将如何在单独的代码中指定这些条件的开始时间和结束时间: 读取= T,方框=输出,ID =“”,并且主题列的第一个单词不包含Re,Fwd或Fw?

 Read      Box       ID      Time                             Subject 
 T         out               10/1/2019 9:00:01 AM
 T         out               10/1/2019 9:00:02 AM             Re:
 T         out               10/1/2019 9:00:03 AM             Re:
 T         out               10/1/2019 9:02:59 AM             Re:
 T         out               10/1/2019 9:03:00 AM
 F                           10/1/2019 9:05:00 AM
 T         out               10/1/2019 9:06:00 AM             Fwd:
 T         out               10/1/2019 9:06:02 AM             Fwd:
 T         in                10/1/2019 9:07:00 AM
 T         in                10/1/2019 9:07:02 AM
 T         out               10/1/2019 9:07:04 AM
 T         out               10/1/2019 9:07:05 AM             Fw:
 T         out               10/1/2019 9:07:06 AM             Fw:
           hello             10/1/2019 9:07:08 AM

这是排除RE,Fwd和Fw时我想要的输出

  starttime                    endtime                     duration

  10/1/2019 9:00:01 AM        10/1/2019 9:00:01 AM         0 secs
  10/1/2019 9:03:00 AM        10/1/2019 9:03:00 AM         0 secs
  10/1/2019 9:07:04 AM        10/1/2019 9:07:04 AM         0 secs

dput:

 structure(list(Read = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 3L, 
3L, 3L, 3L, 4L, 4L, 3L, 1L), .Label = c("", "F", "T", "T "), class = "factor"), 
Box = structure(c(3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 
3L, 3L, 3L, 1L), .Label = c("", "in", "out"), class = "factor"), 
ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L), .Label = c("", "hello"), class = "factor"), 
Time = structure(1:14, .Label = c("10/1/2019 9:00:01 AM", 
"10/1/2019 9:00:02 AM", "10/1/2019 9:00:03 AM", "10/1/2019 9:02:59 AM", 
"10/1/2019 9:03:00 AM", "10/1/2019 9:05:00 AM", "10/1/2019 9:06:00 AM", 
"10/1/2019 9:06:02 AM", "10/1/2019 9:07:00 AM", "10/1/2019 9:07:02 AM", 
"10/1/2019 9:07:04 AM", "10/1/2019 9:07:05 AM", "10/1/2019 9:07:06 AM", 
"10/1/2019 9:07:08 AM"), class = "factor"), Subject = structure(c(1L, 
4L, 4L, 4L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("", 
"Fw:", "Fwd:", "Re:"), class = "factor")), class = "data.frame", row.names = c(NA, 
-14L))

建议的代码有效,我也想同时包含Subject列条件:
其中Subject == FW,FWD,RE(忽略大写/小写) 和 如果Subject不等于FW,FWD,Re(忽略大小写)

library(dplyr)

df %>%
mutate(Time = lubridate::mdy_hms(Time), 
cond = Read == "T" & Box == "out" & ID == "" & Subject == "FW" & Subject  == "FWD" & Subject == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time), 
endtime = last(Time), 
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)

库(dplyr)

df %>%
mutate(Time = lubridate::mdy_hms(Time), 
cond = Read == "T" & Box == "out" & ID == "" & Subject! == "FW" & Subject! == "FWD" & Subject! == "RE" (ignore.case = TRUE)
grp = cumsum(!cond)) %>%
filter(cond) %>%
group_by(grp) %>%
summarise(starttime = first(Time), 
endtime = last(Time), 
duration = difftime(endtime, starttime, units = "secs")) %>%
select(-grp)

1 个答案:

答案 0 :(得分:1)

您的问题的整个部分已经在您的其他问题(Create start and endtime columns based on multiple conditions in R (dplyr, lubridate))中得到了回答。我知道这很困难,但是下次请着重于您尚不了解的问题,将问题缩小到较小的范围。

如果要检测子字符串,最好的方法是使用str_detect包(stringr的一部分)中的tidyverse

library(tidyverse)
library(lubridate)
df %>%
  mutate(Time = mdy_hms(Time), 
         # cond = Read == "T" & Box == "out" & ID == "", #from the answer https://stackoverflow.com/a/60068929/3888000
         cond = Read == "T" & Box == "out" & ID == "" & str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #including those subjects
         # cond = Read == "T" & Box == "out" & ID == "" & !str_detect(Subject, regex('FW|FWD|RE', ignore_case=TRUE)), #excluding those subjects
         grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%
  summarise(starttime = first(Time), 
            endtime = last(Time), 
            duration = difftime(endtime, starttime, units = "secs")) %>%
  select(-grp)

这使用正则表达式(regex),这是一件非常好的事情。该代码只有OR(|)运算符,非常易于阅读,但可能性是无限的。