R:按日期和小时汇总并放入单独的矩阵中

时间:2014-11-11 12:41:57

标签: r date aggregate

我希望获取一个数据框,其中包含按时间排序的数据并汇总到每小时级别,并将数据放入单独的数据框中。最好用一个例子来解释:

tradeData dataframe:

Time                     Amount  
2014-05-16 14:00:05       10  
2014-05-16 14:00:10       20  
2014-05-16 14:08:15       30  
2014-05-16 14:23:09       51  
2014-05-16 14:59:54       84  
2014-05-16 15:09:45       94  
2014-05-16 15:24:41       53  
2014-05-16 16:30:51       44

上面的矩阵包含我想要聚合的数据。下面是我要插入的数据框: HourlyData数据帧:

Time                        Profit  
2014-05-16 00:00:00          100  
2014-05-16 01:00:00          200  
2014-05-16 02:00:00          250  
...  
2014-05-16 14:00:00           30  
2014-05-16 15:00:00          -50   
2014-05-16 16:00:00           67  
...  
2014-05-16 23:00:00           -8  

我想汇总tradeData数据框中的数据,并将其放在hourlyData数据框中的正确位置,如下所示:
新的hourlyData数据框:

Time                        Profit   Amount
2014-05-16 00:00:00          100         0
2014-05-16 01:00:00          200         0
2014-05-16 02:00:00          250         0
...  
2014-05-16 14:00:00           30         0
2014-05-16 15:00:00          -50       195 (10+20+30+51+84)  
2014-05-16 16:00:00           67       147 (94+53)
2014-05-16 17:00:00           20        44
...  
2014-05-16 23:00:00           -8         0

使用下面Akrun提供的解决方案,我能够获得大多数实例的解决方案。但是,当事件在一天的最后一小时内发生时似乎存在问题,如下所示: TradeData

        Time            Amount
2014-08-15 22:09:07     11037.778
2014-08-15 23:01:33     13374.724
2014-08-20 23:25:40     133373.000

HourlyData

  Time                  Amount
2014-08-15 23:00:00     11037.778 (correct)    
2014-08-18 00:00:00         0 (incorrect)  
2014-08-21 00:00:00     133373 (correct)

当在hourlyData数据帧中聚合时,公式似乎跳过tradeData数据帧中第二笔交易的数据。似乎这发生在周五的最后一小时发生的交易,因为(我想)数据不存在于星期六上午12点,即星期五晚上11点+ 1小时。它适用于周一至周四最后一小时的交易。

关于如何调整算法的任何想法?如果有任何不清楚的地方,请告诉我。

由于

麦克

1 个答案:

答案 0 :(得分:1)

尝试

library(dplyr)
res <- left_join(df2,
                   df %>% 
                     group_by(hour=as.POSIXct(cut(Time, breaks='hour'))+3600) %>% 
                     summarise(Amount=sum(Amount)),
                      by=c('Time'='hour'))

res$Amount[is.na(res$Amount)] <- 0
res
#                     Time Profit Amount
#1 2014-05-16 00:00:00    100       0
#2 2014-05-16 01:00:00    200       0
#3 2014-05-16 02:00:00    250       0
#4 2014-05-16 14:00:00     30       0
#5 2014-05-16 15:00:00    -50     195
#6 2014-05-16 16:00:00     67     147
#7 2014-05-16 23:00:00     -8       0

或使用data.table

 library(data.table)
 DT <- data.table(df)
 DT2 <- data.table(df2)
 DT1 <- DT[,list(Amount=sum(Amount)), by=(Time=
               as.POSIXct(cut(Time, breaks='hour'))+3600)]
 setkey(DT1, Time)
 DT1[DT2][is.na(Amount), Amount:=0][]
 #                      Time Amount Profit
 #1: 2014-05-16 00:00:00      0    100
 #2: 2014-05-16 01:00:00      0    200
 #3: 2014-05-16 02:00:00      0    250
 #4: 2014-05-16 14:00:00      0     30
 #5: 2014-05-16 15:00:00    195    -50
 #6: 2014-05-16 16:00:00    147     67
 #7: 2014-05-16 23:00:00      0     -8

更新

根据周末信息,

 indx <- with(df, as.numeric(format(Time, '%H'))==23 & 
           as.numeric(format(Time, '%S'))>0& format(Time, '%a')=='Fri')
 grp <- with(df, as.POSIXct(cut(Time, breaks='hour')))
 grp[indx] <- grp[indx] +3600*49
 grp[!indx] <- grp[!indx]+3600

 df$Time <- grp
 df %>%
    group_by(Time) %>% 
    summarise(Amount=sum(Amount)) #in the example dataset, it is just 3 rows
 #                 Time    Amount
 #1 2014-08-15 23:00:00  11037.78
 #2 2014-08-18 00:00:00  13374.72
 #3 2014-08-21 00:00:00 133373.00

数据

 df <- structure(list(Time = structure(c(1400263205, 1400263210, 1400263695, 
 1400264589, 1400266794, 1400267385, 1400268281, 1400272251), class = c("POSIXct", 
 "POSIXt"), tzone = ""), Amount = c(10L, 20L, 30L, 51L, 84L, 94L, 
 53L, 44L)), .Names = c("Time", "Amount"), row.names = c(NA, -8L
 ), class = "data.frame")

 df2 <- structure(list(Time = structure(c(1400212800, 1400216400, 1400220000, 
 1400263200, 1400266800, 1400270400, 1400295600), class = c("POSIXct", 
 "POSIXt"), tzone = ""), Profit = c(100L, 200L, 250L, 30L, -50L, 
 67L, -8L)), .Names = c("Time", "Profit"), row.names = c(NA, -7L
 ), class = "data.frame")

newdata

 df <- structure(list(Time = structure(c(1408158000, 1408334400, 1408593600
 ), tzone = "", class = c("POSIXct", "POSIXt")), Amount = c(11037.778, 
 13374.724, 133373)), .Names = c("Time", "Amount"), row.names = c(NA, 
 -3L), class = "data.frame")