我希望获取一个数据框,其中包含按时间排序的数据并汇总到每小时级别,并将数据放入单独的数据框中。最好用一个例子来解释:
tradeData dataframe:
Time Amount
2014-05-16 14:00:05 10
2014-05-16 14:00:10 20
2014-05-16 14:08:15 30
2014-05-16 14:23:09 51
2014-05-16 14:59:54 84
2014-05-16 15:09:45 94
2014-05-16 15:24:41 53
2014-05-16 16:30:51 44
上面的矩阵包含我想要聚合的数据。下面是我要插入的数据框: HourlyData数据帧:
Time Profit
2014-05-16 00:00:00 100
2014-05-16 01:00:00 200
2014-05-16 02:00:00 250
...
2014-05-16 14:00:00 30
2014-05-16 15:00:00 -50
2014-05-16 16:00:00 67
...
2014-05-16 23:00:00 -8
我想汇总tradeData数据框中的数据,并将其放在hourlyData数据框中的正确位置,如下所示:
新的hourlyData数据框:
Time Profit Amount
2014-05-16 00:00:00 100 0
2014-05-16 01:00:00 200 0
2014-05-16 02:00:00 250 0
...
2014-05-16 14:00:00 30 0
2014-05-16 15:00:00 -50 195 (10+20+30+51+84)
2014-05-16 16:00:00 67 147 (94+53)
2014-05-16 17:00:00 20 44
...
2014-05-16 23:00:00 -8 0
使用下面Akrun提供的解决方案,我能够获得大多数实例的解决方案。但是,当事件在一天的最后一小时内发生时似乎存在问题,如下所示: TradeData
Time Amount
2014-08-15 22:09:07 11037.778
2014-08-15 23:01:33 13374.724
2014-08-20 23:25:40 133373.000
HourlyData
Time Amount
2014-08-15 23:00:00 11037.778 (correct)
2014-08-18 00:00:00 0 (incorrect)
2014-08-21 00:00:00 133373 (correct)
当在hourlyData数据帧中聚合时,公式似乎跳过tradeData数据帧中第二笔交易的数据。似乎这发生在周五的最后一小时发生的交易,因为(我想)数据不存在于星期六上午12点,即星期五晚上11点+ 1小时。它适用于周一至周四最后一小时的交易。
关于如何调整算法的任何想法?如果有任何不清楚的地方,请告诉我。
由于
麦克
答案 0 :(得分:1)
尝试
library(dplyr)
res <- left_join(df2,
df %>%
group_by(hour=as.POSIXct(cut(Time, breaks='hour'))+3600) %>%
summarise(Amount=sum(Amount)),
by=c('Time'='hour'))
res$Amount[is.na(res$Amount)] <- 0
res
# Time Profit Amount
#1 2014-05-16 00:00:00 100 0
#2 2014-05-16 01:00:00 200 0
#3 2014-05-16 02:00:00 250 0
#4 2014-05-16 14:00:00 30 0
#5 2014-05-16 15:00:00 -50 195
#6 2014-05-16 16:00:00 67 147
#7 2014-05-16 23:00:00 -8 0
或使用data.table
library(data.table)
DT <- data.table(df)
DT2 <- data.table(df2)
DT1 <- DT[,list(Amount=sum(Amount)), by=(Time=
as.POSIXct(cut(Time, breaks='hour'))+3600)]
setkey(DT1, Time)
DT1[DT2][is.na(Amount), Amount:=0][]
# Time Amount Profit
#1: 2014-05-16 00:00:00 0 100
#2: 2014-05-16 01:00:00 0 200
#3: 2014-05-16 02:00:00 0 250
#4: 2014-05-16 14:00:00 0 30
#5: 2014-05-16 15:00:00 195 -50
#6: 2014-05-16 16:00:00 147 67
#7: 2014-05-16 23:00:00 0 -8
根据周末信息,
indx <- with(df, as.numeric(format(Time, '%H'))==23 &
as.numeric(format(Time, '%S'))>0& format(Time, '%a')=='Fri')
grp <- with(df, as.POSIXct(cut(Time, breaks='hour')))
grp[indx] <- grp[indx] +3600*49
grp[!indx] <- grp[!indx]+3600
df$Time <- grp
df %>%
group_by(Time) %>%
summarise(Amount=sum(Amount)) #in the example dataset, it is just 3 rows
# Time Amount
#1 2014-08-15 23:00:00 11037.78
#2 2014-08-18 00:00:00 13374.72
#3 2014-08-21 00:00:00 133373.00
df <- structure(list(Time = structure(c(1400263205, 1400263210, 1400263695,
1400264589, 1400266794, 1400267385, 1400268281, 1400272251), class = c("POSIXct",
"POSIXt"), tzone = ""), Amount = c(10L, 20L, 30L, 51L, 84L, 94L,
53L, 44L)), .Names = c("Time", "Amount"), row.names = c(NA, -8L
), class = "data.frame")
df2 <- structure(list(Time = structure(c(1400212800, 1400216400, 1400220000,
1400263200, 1400266800, 1400270400, 1400295600), class = c("POSIXct",
"POSIXt"), tzone = ""), Profit = c(100L, 200L, 250L, 30L, -50L,
67L, -8L)), .Names = c("Time", "Profit"), row.names = c(NA, -7L
), class = "data.frame")
df <- structure(list(Time = structure(c(1408158000, 1408334400, 1408593600
), tzone = "", class = c("POSIXct", "POSIXt")), Amount = c(11037.778,
13374.724, 133373)), .Names = c("Time", "Amount"), row.names = c(NA,
-3L), class = "data.frame")