R有条件地匹配从第二个数据帧中的一个数据帧到最接近的日期时间字段的日期时间

时间:2015-03-16 04:15:46

标签: r datetime merge dataframe time-series

我有两个数据框,df.events和df.activ。

df.activ具有非常精细的分钟级数据和比df.events多一个数量级的记录(1,000,000+),df.events具有~100,000条记录,也是微小级别的粒度。这两个数据帧有两个常见字段,DateTime和Geo。两个DateTime列都在as.POSIXlt中,%Y-%m-%d%H:%M:%S格式。

df.activ <- read.table(text=
                          '"DateTime","Geo","Bin1","Bin2"
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,510,0,1
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:13:00,618,1,1
                        2014-07-01 00:13:00,510,0,1
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")

df.events <- read.table(text=
                          '"Units","Geo","DateTime"
                        225,999,2014-07-01 00:09:00
                        40,510,2014-07-01 00:12:00
                        5,999,2014-07-01 00:28:00
                        115,999,2014-07-01 00:44:00
                        0,999,2014-07-01 00:47:00',header=TRUE,sep=",")

如果同一行(在df.events中)的Geo字段值为999,我的目标是将df.activ合并到df.events中最近的DateTime。

如果df.event的Geo不是999,那么我只想在地理字段匹配时合并df.event(例如,在提供的数据帧中Geo = 510的情况)。

我知道for循环不是解决R中事物的正确方法,但从概念上讲,我希望通过循环df.activ的DateTime字段并引入最接近的记录来进行嵌套for循环如果Geo字段为999或与df.activ中的Geo字段匹配,则来自df.events的DateTime。

以下数据框就是我所追求的:

df.idealresults <- read.table(text=
                              'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
                              7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')

到目前为止,我已经能够将df.activ合并到df.events中最近的DateTime。我使用基于na.locf的方法完成了这一点,其灵感来自this SO post的答案的后半部分。我一直在努力将Geo匹配逻辑融入到这种方法中; na.locf的本质使得这项工作难以正确,因为它依赖于向量来映射在合并步骤之前绑定的NAs。

2 个答案:

答案 0 :(得分:2)

有时很难避免循环,特别是当你有像你这样的条件时。有时我们最终会花费大量精力来避免它们,而它们可能是我们能做的最好的,或者在性能和/或可读性方面都不会太落后。话虽如此,这可以解决问题:

df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00

答案 1 :(得分:0)

我在工作,这似乎相对解决,所以我会简短。您还可以执行完全外部合并,然后只需考虑日期中的差异。使用distinct按日期差异的绝对值排序。

这可能是算法最快的合并方式,但需要更多的RAM而不是循环(你的完整合并将有n1 * n2个观察点)。