过滤时间序列数据,该时间序列数据是R中另一个时间序列数据的范围

时间:2018-07-30 01:53:45

标签: r

我的问题是关于时间序列数据。 假设我有一个文件,名为P1,列为Time.StampValue。数据表如下:

Time.Stamp
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:08
01/01/2017 19:09
01/01/2017 19:09

Value      
12
24
45
56
78
76
34
65
87

我还有另一个单独的文件,名为P2,其中有两列,“ Transaction from”和“ transaction to”。它包含以下列:

交易来自

01/01/2017 19:00
01/01/2017 19:15
02/01/2017 08:45
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
03/01/2017 18:00
03/01/2017 18:15
03/01/2017 23:45
04/01/2017 00:15
04/01/2017 01:45

的交易
01/01/2017 19:15
01/01/2017 19:30
02/01/2017 09:00
02/01/2017 09:15
02/01/2017 09:30
02/01/2017 09:45
03/01/2017 18:15
03/01/2017 18:30
04/01/2017 00:00
04/01/2017 00:30
04/01/2017 02:00

现在我要在R中搜索,文件P1中的“ Time.Stamp”属于文件P2中“从...进行交易”到“向...进行交易”的持续时间。如果任何“ Time.Stamp”在提到的P2的两列范围内,则与Time.stamp相关的值将被汇总。文件P1和文件P2的列长度不相等。 P1的长度比P2的长度长得多。

如果有人可以在R中找到解决方案,这将非常有帮助。

3 个答案:

答案 0 :(得分:1)

这可能是How to perform join over date ranges using data.table?的重复项,假设P1和P2是数据帧,并且日期开头是POSIXct,这是data.table提供的救生员联接:

library(data.table)

setDT(P1)
setDT(P2)

P1[ , dummy := Time.Stamp]

setkey(P2, Transaction.from, transaction.to)

dt <- foverlaps(
  P1,
  P2,
  by.x = c("Time.Stamp", "dummy"),
  # mult = "first"/mult = "first" will only choose first/last match
  nomatch = 0L
)[ , dummy := NULL]

# you can run ?data.table::foverlaps for the documentation 

有关逐步说明和其他可能的答案,请参考this great blog post

在此之后,您可以简单地:

library(dplyr)

dt %>%
  group_by(Transaction.from) %>%
  mutate(total = sum(value))

请注意,对于您要求的简单聚合,此解决方案可能看起来很长。但是,如果您需要合并数据框并进行更复杂的分析,它将非常方便。

答案 1 :(得分:1)

首先,将所有日期转换为as.POSIXct(x,format = "%d/%m/%Y %H:%M")。然后通过以下函数查看p1$Time.Stamp的每个元素是否在p2[,1]p2[,2]的任何时期内,然后进行汇总:

isitthere<- function(x,from=p2$`Transaction from`,to=p2$`transaction to`){
  any(x >=from & x<= to)
}

将该功能应用于所有p1$Time.Stamp

index<-sapply(p1$Time.Stamp, isitthere,from=p2$`Transaction from`,to=p2$`transaction to`)
index
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

现在合计:

sum(p1$Value[index])
[1] 477

答案 2 :(得分:0)

我不清楚要由什么汇总的内容,但假设DF1和DF2如最后的注释中所定义,那么对于DF2中的每一行,它将在DF1中查找零行或更多行,然后对具有相同Transaction.from和Transaction.to的行的所有值求和。

library(sqldf)

sqldf("select [Transaction.from], [Transaction.to], sum(Value) as Value
       from DF2
       left join DF1 on [Time.Stamp] between [Transaction.from] and [Transaction.to]
       group by [Transaction.from], [Transaction.to]")

给予:

      Transaction.from      Transaction.to Value
1  2017-01-01 19:00:00 2017-01-01 19:15:00   477
2  2017-01-01 19:15:00 2017-01-01 19:30:00    NA
3  2017-02-01 08:45:00 2017-02-01 09:00:00    NA
4  2017-02-01 09:00:00 2017-02-01 09:15:00    NA
5  2017-02-01 09:15:00 2017-02-01 09:30:00    NA
6  2017-02-01 09:30:00 2017-02-01 09:45:00    NA
7  2017-03-01 18:00:00 2017-03-01 18:15:00    NA
8  2017-03-01 18:15:00 2017-03-01 18:30:00    NA
9  2017-03-01 23:45:00 2017-04-01 00:00:00    NA
10 2017-04-01 00:15:00 2017-04-01 00:30:00    NA
11 2017-04-01 01:45:00 2017-04-01 02:00:00    NA

注意

Lines1 <- "
Time.Stamp,Value
01/01/2017 19:08,12
01/01/2017 19:08,24
01/01/2017 19:08,45
01/01/2017 19:08,56
01/01/2017 19:08,78
01/01/2017 19:08,76
01/01/2017 19:08,34
01/01/2017 19:09,65
01/01/2017 19:09,87
"
DF1 <- read.csv(text = Lines1)
fmt <- "%m/%d/%Y %H:%M"
DF1 <- transform(DF1, Time.Stamp = as.POSIXct(Time.Stamp, format = fmt))

Lines2 <- "
Transaction.from,Transaction.to
01/01/2017 19:00,01/01/2017 19:15
01/01/2017 19:15,01/01/2017 19:30
02/01/2017 08:45,02/01/2017 09:00
02/01/2017 09:00,02/01/2017 09:15
02/01/2017 09:15,02/01/2017 09:30
02/01/2017 09:30,02/01/2017 09:45
03/01/2017 18:00,03/01/2017 18:15
03/01/2017 18:15,03/01/2017 18:30
03/01/2017 23:45,04/01/2017 00:00
04/01/2017 00:15,04/01/2017 00:30
04/01/2017 01:45,04/01/2017 02:00
"
DF2 <- read.csv(text = Lines2)
DF2 <- transform(DF2, Transaction.from = as.POSIXct(Transaction.from, format = fmt),
                      Transaction.to = as.POSIXct(Transaction.to, format = fmt))