使用与组相关联的时间间隔,使用dplyr和purr函数对数据进行子集化

时间:2018-01-16 09:06:53

标签: r dplyr purrr

我是purr软件包的新手,但我想将它用于下面列出的示例而不是apply函数。我有一个长整齐格式的数据框,其中包含多个组的温度数据:

df <- data.frame(stringsAsFactors=FALSE,
       Date.Time = c("5/29/2016 15:00", "7/20/2016 17:10", "6/2/2016 17:20",
                     "6/10/2016 17:30", "6/28/2016 17:40", "5/29/2016 17:50"),
           TempC = c(22.61, 22.235, 22.11, 22.36, 21.67, 21.54),
            Site = c("DH1", "DL1", "EH1", "EL2", "DH2", "DL2"))

此数据集目前包含位于目标期间之外的记录。我需要使用下面生成的时间间隔来提取落在任何提供的时间间隔内的每个组的记录。

intervals <- data.frame(stringsAsFactors=FALSE,
            Site = c("DL1", "DH1", "DH2", "DL2", "EL2", "EH1", "EH3", "EH2",
                     "DL3", "DH3"),
   full.interval = c("2016-05-29 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-29 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-30 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-30 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-31 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-31 17:00:00 UTC--2016-06-28 16:40:00 UTC",
                     "2016-06-01 17:00:00 UTC--2016-06-28 15:20:00 UTC",
                     "2016-06-01 17:00:00 UTC--2016-06-28 14:00:00 UTC", "2016-06-04 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-06-02 17:00:00 UTC--2016-06-28 14:00:00 UTC")
)

我知道我需要使用purr的map()和keep()函数以及dplyr的group_by()的一些组合,但是我不知道如何构建代码到映射两个数据帧和多个组。

所需的输出将是包含记录的新数据框:

new.df <- data.frame(stringsAsFactors=FALSE,
Date.Time = c("6/2/2016 17:20","6/10/2016 17:30"),
               TempC = c(22.11, 22.36),
                Site = c("EH1", "EL2"))

提前致谢!

2 个答案:

答案 0 :(得分:1)

这并没有使用purrr,但这是一种方式:

library(dplyr)
library(lubridate)

# add discrete start/stop columns to intervals
intervals <-
  intervals %>%
  mutate(starts = gsub('--.*$', '', full.interval) %>% ymd_hms,
         stops =  gsub('^.*--', '', full.interval) %>% ymd_hms)

# associate each row in DF with the interval for that site, and filter
df %>%
  merge(intervals, by='Site') %>%
  mutate(in_range = 
           mdy_hm(Date.Time) >= starts &
           mdy_hm(Date.Time) <= stops) %>%
  filter(in_range == TRUE)

更新:当df更大时,这也可以正常运行:

# make a big version of df (3.7 million rows)
df_long <- df[rep(1:6, length.out=3.7e6),]

# associate each row in DF with the interval for that site, and filter
beg_time <- Sys.time()
results <- df_long %>%
  merge(intervals, by='Site') %>%
  mutate(in_range = 
           mdy_hm(Date.Time) >= starts &
           mdy_hm(Date.Time) <= stops) %>%
  filter(in_range == TRUE)
print(Sys.time() - beg_time)

在我的macbook pro笔记本电脑上,我们运行的是16mb ram:

Time difference of 20.35184 secs

答案 1 :(得分:0)

根据您的上述评论,我就是这样做的。

library(dplyr)
library(tidyr)
df <- df %>% mutate(Date.Time=as.POSIXct(Date.Time,format="%m/%d/%Y %H:%M",tz = "UTC"))
intervals <- intervals %>% 
  separate(full.interval, into=c('Start','End'),sep="--") %>%
  mutate(Start=as.POSIXct(Start,format="%Y-%m-%d %H:%M:%S",tz = "UTC"),
         End=as.POSIXct(End,format="%Y-%m-%d %H:%M:%S",tz = "UTC"))


output <- df %>% inner_join(intervals2,by="Site") %>% filter(Date.Time>Start & Date.Time<End)

> output
            Date.Time TempC Site               Start                 End
1 2016-06-02 17:20:00 22.11  EH1 2016-05-31 17:00:00 2016-06-28 16:40:00
2 2016-06-10 17:30:00 22.36  EL2 2016-05-31 17:00:00 2016-06-28 14:00:00