如果没有值,我想将以下数据帧转换为宽格式的每小时时间序列,零作为填充。基本上我想将带有start / endperiod的数据帧转换为每小时的时间序列:
structure(list(AffectedAssetMask = structure(c(1L, 1L, 2L, 3L,
3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("A_1", "A_2",
"A_3", "A_4", "A_5"), class = "factor"), EventStart = structure(c(1525737600,
1525723200, 1525982400, 1525694400, 1525651200, 1525636800, 1525723200,
1525662000, 1525658400, 1525654800, 1525651200, 1525662000, 1525662000
), class = c("POSIXct", "POSIXt"), tzone = "Europe/Rome"), EventStop = structure(c(1525748399,
1525737599, 1526155199, 1525723199, 1525694399, 1525651199, 1525809599,
1525665599, 1525661999, 1525658399, 1525654799, 1525665599, 1525665599
), class = c("POSIXct", "POSIXt"), tzone = "Europe/Rome"), UnavailableCapacity = c(150.95,
154.742, 115, 172.342, 198.163, 172.342, 115, 197, 281, 330,
340, 197, 197)), row.names = c(NA, -13L), class = "data.frame")
输出:
AffectedAssetMask EventStart EventStop UnavailableCapacity
A_1 2018-05-08 02:00:00 2018-05-08 04:59:59 150.950
A_1 2018-05-07 22:00:00 2018-05-08 01:59:59 54.742
A_2 2018-05-10 22:00:00 2018-05-12 21:59:59 115.000
A_3 2018-05-07 14:00:00 2018-05-07 21:59:59 172.342
A_3 2018-05-07 02:00:00 2018-05-07 13:59:59 198.163
A_3 2018-05-06 22:00:00 2018-05-07 01:59:59 172.342
A_4 2018-05-07 22:00:00 2018-05-08 21:59:59 115.000
A_5 2018-05-07 05:00:00 2018-05-07 05:59:59 197.000
A_5 2018-05-07 04:00:00 2018-05-07 04:59:59 281.000
A_5 2018-05-07 03:00:00 2018-05-07 03:59:59 330.000
A_5 2018-05-07 02:00:00 2018-05-07 02:59:59 340.000
A_5 2018-05-07 05:00:00 2018-05-07 05:59:59 197.000
A_5 2018-05-07 05:00:00 2018-05-07 05:59:59 197.000
最后两个条目不是我的错误,但有时数据包含具有相同值的双重条目,但我认为" unique"我应该能够过滤掉它们。
更多数据信息:
> str(df1)
'data.frame': 13 obs. of 4 variables:
$ AffectedAssetMask : Factor w/ 5 levels "A_1","A_2","A_3",..: 1 1 2 3 3 3 4 5 5 5 ...
$ EventStart : POSIXct, format: "2018-05-08 02:00:00" "2018-05-07 22:00:00" "2018-05-10 22:00:00" "2018-05-07 14:00:00" ...
$ EventStop : POSIXct, format: "2018-05-08 04:59:59" "2018-05-08 01:59:59" "2018-05-12 21:59:59" "2018-05-07 21:59:59" ...
$ UnavailableCapacity: num 151 155 115 172 198 ...
我需要宽格式的数据,而不是让EventStart / EventStop时段成为每小时具有相同值的每小时索引,并且不包括零作为值。
将其转换为宽幅格式:
library(reshape2)
#cast long-format data into wide-format
dcast(df1, EventStart + EventStop ~ AffectedAssetMask, value.var = "UnavailableCapacity", fun.aggregate = mean)
格式很好,但不是EventStart / EventStop而是我需要一个每小时连续的时间序列,零为"填充"。
我能够以我想要的格式创建一个空数据框:
startPeriod<-as.POSIXct(strptime("01/05/2018 00:00:00", "%d/%m/%Y %H:%M:%S"))
endPeriod<-as.POSIXct(strptime("30/05/2018 00:00:00", "%d/%m/%Y %H:%M:%S"))
#create monthly hourly sequence
dataseq <- seq(startPeriod,endPeriod,by = 3600)
list<-unique(df1$AffectedAssetMask)
NROW<-as.numeric(length(dataseq))
OverallDataFrame<-as.data.frame(matrix(0,NROW,length(list)))
names(OverallDataFrame)<-list
rownames(OverallDataFrame) <- dataseq
目标输出表应该是这样的:
A_1 A_2 A_3 A_4 A_5
2018-05-06 18:00:00 0 0 0 0 0
2018-05-06 19:00:00 0 0 0 0 0
2018-05-06 20:00:00 0 0 0 0 0
2018-05-06 21:00:00 0 0 0 0 0
2018-05-06 22:00:00 0 0 172.342 0 0
2018-05-06 23:00:00 0 0 172.342 0 0
2018-05-06 00:00:00 0 0 172.342 0 0
2018-05-07 01:00:00 0 0 172.342 0 0
2018-05-07 02:00:00 0 0 198.163 340 0
2018-05-07 03:00:00 0 0 198.163 0 330
2018-05-07 04:00:00 0 0 198.163 0 281
2018-05-07 05:00:00 0 0 198.163 0 197
数据来自的csv文件大约为8-9兆字节,for循环需要太长时间。有些人建议使用&#34; match_fun&#34;来设置fuzzyjoin和fuzzy_left_join。参数,另一个检查索引是否在intervall中的函数,不幸的是我有点丢失了。
非常感谢!