R将第二个数据聚合到分钟更有效率

时间:2018-06-14 09:25:51

标签: r data.table aggregate posixct

我有一个data.table,allData,包含来自不同夜晚的大约每个(POSIXct)秒的数据。然而有些夜晚是在同一天,因为数据是从不同的人那里收集的,所以我有一个晚上的夜晚作为每个不同夜晚的id。

          timestamp  nightNo    data1     data2
2018-10-19 19:15:00        1        1         7
2018-10-19 19:15:01        1        2         8
2018-10-19 19:15:02        1        3         9
2018-10-19 18:10:22        2        4        10
2018-10-19 18:10:23        2        5        11 
2018-10-19 18:10:24        2        6        12

我想将数据汇总到分钟(每晚)并使用this question我想出了以下代码:

aggregate_minute <- function(df){
  df %>% 
    group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
    summarise(data1= mean(data1), data2= mean(data2)) %>%
    as.data.table()
 }

allData <- allData[, aggregate_minute(allData), by=nightNo]

但是我的data.table非常大,而且这段代码还不够快。有没有更有效的方法来解决这个问题?

2 个答案:

答案 0 :(得分:2)

allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)), 
                     nightNo = rep(1:2, c(3, 3)),
                     data1 = 1:6,
                     data2  = 7:12)
                 timestamp nightNo data1 data2
1: 2018-06-14 10:43:11       1     1     7
2: 2018-06-14 10:43:11       1     2     8
3: 2018-06-14 10:43:11       1     3     9
4: 2018-06-14 10:48:31       2     4    10
5: 2018-06-14 10:48:31       2     5    11
6: 2018-06-14 10:48:31       2     6    12


allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
       nightNo           timestamp data1 data2
1:       1 2018-06-14 10:43:00     2     8
2:       2 2018-06-14 10:48:00     5    11

> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
    user  system elapsed 
    3.25    0.02    3.31 

> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
     user  system elapsed 
     1.02    0.04    1.06 

答案 1 :(得分:1)

您可以使用lubridate“围绕”日期,然后使用data.table汇总列。

library(data.table)  
library(lubridate)

可重复数据:

text <- "timestamp  nightNo    data1     data2
'2018-10-19 19:15:00'        1        1         7
'2018-10-19 19:15:01'        1        2         8
'2018-10-19 19:15:02'        1        3         9
'2018-10-19 18:10:22'        2        4        10
'2018-10-19 18:10:23'        2        5        11 
'2018-10-19 18:10:24'        2        6        12"


allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)

创建data.table

setDT(allData)

创建时间戳并将其降至最近的分钟:

allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]

将整数列的类型更改为numeric

allData[, ':='(data1 = as.numeric(data1), 
               data2 = as.numeric(data2))]

nightNo组代替数据列:

allData[, ':='(data1 = mean(data1), 
               data2 = mean(data2)),
        by = nightNo]

结果是:

             timestamp nightNo data1 data2
1: 2018-10-19 19:15:00       1     2     8
2: 2018-10-19 19:15:00       1     2     8
3: 2018-10-19 19:15:00       1     2     8
4: 2018-10-19 18:10:00       2     5    11
5: 2018-10-19 18:10:00       2     5    11
6: 2018-10-19 18:10:00       2     5    11