问题: 我想要合并两个数据帧,具体取决于一个数据帧的日期/时间是否在另一个数据帧的间隔内。
流量:日期和时间(Posixct),频率
mydata:间隔,频率和
我现在想计算来自流量的Posixct时间是否在mydata的间隔内,如果这是真的,我想计算mydata中“Frequencies of sum”列中的频率。
我遇到的两个问题: 1.交通数据帧的行数明显多于mydata。我不知道如何告诉R循环遍历流量中的每个观察以检查mydata中的一行。
以下是数据:
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
mydata$SumFrequency <- NA
预期结果应该是这样的:
mydata$SumFrequency <- c(24, 2, 2)
head(mydata)
我从包lubridate尝试了int_overlaps。 任何有关如何解决这个问题的提示都非常感谢!
答案 0 :(得分:3)
使用data.table
方法between
按时过滤流量数据集:
setDT(traffic)
setDT(mydata)
mydata[,SumFrequency := as.numeric(SumFrequency)] # coerce logical to numeric for next step.
mydata[,SumFrequency := sum( traffic[ DateTime %between% c(DateTime1, DateTime2), Frequency] ), by=1:nrow(mydata)]
给出:
DateTime1 DateTime2 Interval SumFrequency
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2
如果mydata中有很多行,最好创建一个索引列并在by子句中使用它:
mydata[, idx := .I]
mydata[, SumFrequency := sum( traffic[DateTime %between% c(DateTime1, DateTime2),Frequency] ),by=idx]
这就是:
DateTime1 DateTime2 Interval SumFrequency idx
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24 1
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2 3
答案 1 :(得分:3)
来自 data.table 包的foverlaps
的简短解决方案:
mydata <- data.table(DateTime1, DateTime2, key = c("DateTime1", "DateTime2"))
traffic <- data.table(start = DateTime, end = DateTime, Frequency, key = c("start","end"))
foverlaps(traffic, mydata, type="within", nomatch=0L)[, .(sumFreq = sum(Frequency)),
by = .(DateTime1, DateTime2)]
给出:
DateTime1 DateTime2 sumFreq
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
答案 2 :(得分:2)
我看到两个解决方案:
data.frame
和plyr
您可以使用%within%
中的lubridate
函数以及for循环或使用plyr
循环函数(如dlply
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
library(plyr)
# Create a group-by variable
mydata$NumInt <- 1:nrow(mydata)
mydata$SumFrequency <- dlply(mydata, .(NumInt),
function(row){
sum(
traffic[traffic$DateTime %within% row$Interval, "Frequency"]
)
})
mydata
#> DateTime1 DateTime2
#> 1 2014-11-01 04:00:00 2014-11-01 04:15:00
#> 2 2015-08-01 04:03:00 2015-08-01 04:13:00
#> 3 2015-08-01 14:00:00 2015-08-01 14:15:00
#> Interval NumInt SumFrequency
#> 1 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 1 24
#> 2 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
#> 3 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 3 2
data.table
和函数foverlaps
data.table
已经实现了一个重叠连接的功能,您可以在一个小技巧中使用它
此功能为foverlaps
(我使用data.table 1.9.6
下方)
(请参阅How to perform join over date ranges using data.table? and this presentation)
请注意,您无需使用lubridate
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.table(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
mydata <- data.table(DateTime1 = as.POSIXct(DateTime1), DateTime2 = as.POSIXct(DateTime2))
# Use function `foverlaps` for overlapping joins
# Here's the trick : create a dummy variable to artificially have an interval
traffic[, dummy:=DateTime]
setkey(mydata, DateTime1, DateTime2)
# do the join
mydata2 <- foverlaps(traffic, mydata, by.x=c("DateTime", "dummy"), type ="within", nomatch=0L)[, dummy := NULL][]
mydata2
#> DateTime1 DateTime2 DateTime Frequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 1
#> 2: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:03:00 2
#> 3: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:06:00 3
#> 4: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:08:00 5
#> 5: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:10:00 12
#> 6: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:12:00 1
#> 7: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:13:00 2
#> 8: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:15:00 1
#> 9: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:13:00 1
# summarise with a sum by grouping by each line of mydata
setkeyv(mydata2, key(mydata))
mydata2[mydata, .(SumFrequency = sum(Frequency)), by = .EACHI]
#> DateTime1 DateTime2 SumFrequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
#> 2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
#> 3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
答案 3 :(得分:1)
就第2点而言,您可以使用聚合作为实例
aggData <- aggregate(traffic$Frequency~format(traffic$DateTime, "%Y%m%d h:m"), data=traffic, sum)
以分钟为间隔对所有频率求和。
对于第1点而言,并非合并工作?
merge(x = myData, y = aggData, by = "DateTime", all.x = TRUE)
外部合并解释为here
答案 4 :(得分:1)
使用for.loop
我们可以这样做:
for(i in 1:nrow(mydata)) {
mydata$SumFrequency[i] <- sum(traffic$Frequency[traffic$DateTime %within% mydata$Interval[i]])
}
> mydata
# DateTime1 DateTime2 Interval SumFrequency
#1 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
#2 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
#3 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2