我的问题是关于时间序列数据。
假设我有一个文件,名为P1,列为Time.Stamp
和Value
。数据表如下:
Time.Stamp
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:09
01/01/2017 19:09
Value
12
24
45
56
78
76
34
65
87
我还有另一个单独的文件,名为P2,其中有两列,“ Transaction from”和“ transaction to”。它包含以下列:
交易来自
01/01/2017 19:00
01/01/2017 19:15
02/01/2017 08:45
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
03/01/2017 18:00
03/01/2017 18:15
03/01/2017 23:45
04/01/2017 00:15
04/01/2017 01:45
与
的交易01/01/2017 19:15
01/01/2017 19:30
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
02/01/2017 09:45
03/01/2017 18:15
03/01/2017 18:30
04/01/2017 00:00
04/01/2017 00:30
04/01/2017 02:00
现在我要在R中搜索,文件P1中的“ Time.Stamp”属于文件P2中“从...进行交易”到“向...进行交易”的持续时间。如果任何“ Time.Stamp”在提到的P2的两列范围内,则与Time.stamp相关的值将被汇总。文件P1和文件P2的列长度不相等。 P1的长度比P2的长度长得多。
如果有人可以在R中找到解决方案,这将非常有帮助。
答案 0 :(得分:1)
这可能是How to perform join over date ranges using data.table?的重复项,假设P1和P2是数据帧,并且日期开头是POSIXct,这是data.table
提供的救生员联接:
library(data.table)
setDT(P1)
setDT(P2)
P1[ , dummy := Time.Stamp]
setkey(P2, Transaction.from, transaction.to)
dt <- foverlaps(
P1,
P2,
by.x = c("Time.Stamp", "dummy"),
# mult = "first"/mult = "first" will only choose first/last match
nomatch = 0L
)[ , dummy := NULL]
# you can run ?data.table::foverlaps for the documentation
有关逐步说明和其他可能的答案,请参考this great blog post。
在此之后,您可以简单地:
library(dplyr)
dt %>%
group_by(Transaction.from) %>%
mutate(total = sum(value))
请注意,对于您要求的简单聚合,此解决方案可能看起来很长。但是,如果您需要合并数据框并进行更复杂的分析,它将非常方便。
答案 1 :(得分:1)
首先,将所有日期转换为as.POSIXct(x,format = "%d/%m/%Y %H:%M")
。然后通过以下函数查看p1$Time.Stamp
的每个元素是否在p2[,1]
到p2[,2]
的任何时期内,然后进行汇总:
isitthere<- function(x,from=p2$`Transaction from`,to=p2$`transaction to`){
any(x >=from & x<= to)
}
将该功能应用于所有p1$Time.Stamp
:
index<-sapply(p1$Time.Stamp, isitthere,from=p2$`Transaction from`,to=p2$`transaction to`)
index
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
现在合计:
sum(p1$Value[index])
[1] 477
答案 2 :(得分:0)
我不清楚要由什么汇总的内容,但假设DF1和DF2如最后的注释中所定义,那么对于DF2中的每一行,它将在DF1中查找零行或更多行,然后对具有相同Transaction.from和Transaction.to的行的所有值求和。
library(sqldf)
sqldf("select [Transaction.from], [Transaction.to], sum(Value) as Value
from DF2
left join DF1 on [Time.Stamp] between [Transaction.from] and [Transaction.to]
group by [Transaction.from], [Transaction.to]")
给予:
Transaction.from Transaction.to Value
1 2017-01-01 19:00:00 2017-01-01 19:15:00 477
2 2017-01-01 19:15:00 2017-01-01 19:30:00 NA
3 2017-02-01 08:45:00 2017-02-01 09:00:00 NA
4 2017-02-01 09:00:00 2017-02-01 09:15:00 NA
5 2017-02-01 09:15:00 2017-02-01 09:30:00 NA
6 2017-02-01 09:30:00 2017-02-01 09:45:00 NA
7 2017-03-01 18:00:00 2017-03-01 18:15:00 NA
8 2017-03-01 18:15:00 2017-03-01 18:30:00 NA
9 2017-03-01 23:45:00 2017-04-01 00:00:00 NA
10 2017-04-01 00:15:00 2017-04-01 00:30:00 NA
11 2017-04-01 01:45:00 2017-04-01 02:00:00 NA
Lines1 <- "
Time.Stamp,Value
01/01/2017 19:08,12
01/01/2017 19:08,24
01/01/2017 19:08,45
01/01/2017 19:08,56
01/01/2017 19:08,78
01/01/2017 19:08,76
01/01/2017 19:08,34
01/01/2017 19:09,65
01/01/2017 19:09,87
"
DF1 <- read.csv(text = Lines1)
fmt <- "%m/%d/%Y %H:%M"
DF1 <- transform(DF1, Time.Stamp = as.POSIXct(Time.Stamp, format = fmt))
Lines2 <- "
Transaction.from,Transaction.to
01/01/2017 19:00,01/01/2017 19:15
01/01/2017 19:15,01/01/2017 19:30
02/01/2017 08:45,02/01/2017 09:00
02/01/2017 09:00,02/01/2017 09:15
02/01/2017 09:15,02/01/2017 09:30
02/01/2017 09:30,02/01/2017 09:45
03/01/2017 18:00,03/01/2017 18:15
03/01/2017 18:15,03/01/2017 18:30
03/01/2017 23:45,04/01/2017 00:00
04/01/2017 00:15,04/01/2017 00:30
04/01/2017 01:45,04/01/2017 02:00
"
DF2 <- read.csv(text = Lines2)
DF2 <- transform(DF2, Transaction.from = as.POSIXct(Transaction.from, format = fmt),
Transaction.to = as.POSIXct(Transaction.to, format = fmt))