我有两个数据框,df.events和df.activ。
df.activ具有非常精细的分钟级数据和比df.events多一个数量级的记录(1,000,000+),df.events具有~100,000条记录,也是微小级别的粒度。这两个数据帧有两个常见字段,DateTime和Geo。两个DateTime列都在as.POSIXlt中,%Y-%m-%d%H:%M:%S格式。
df.activ <- read.table(text=
'"DateTime","Geo","Bin1","Bin2"
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:11:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,510,0,1
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:12:00,NA,0,0
2014-07-01 00:13:00,618,1,1
2014-07-01 00:13:00,510,0,1
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0
2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")
df.events <- read.table(text=
'"Units","Geo","DateTime"
225,999,2014-07-01 00:09:00
40,510,2014-07-01 00:12:00
5,999,2014-07-01 00:28:00
115,999,2014-07-01 00:44:00
0,999,2014-07-01 00:47:00',header=TRUE,sep=",")
如果同一行(在df.events中)的Geo字段值为999,我的目标是将df.activ合并到df.events中最近的DateTime。
如果df.event的Geo不是999,那么我只想在地理字段匹配时合并df.event(例如,在提供的数据帧中Geo = 510的情况)。
我知道for循环不是解决R中事物的正确方法,但从概念上讲,我希望通过循环df.activ的DateTime字段并引入最接近的记录来进行嵌套for循环如果Geo字段为999或与df.activ中的Geo字段匹配,则来自df.events的DateTime。
以下数据框就是我所追求的:
df.idealresults <- read.table(text=
'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')
到目前为止,我已经能够将df.activ合并到df.events中最近的DateTime。我使用基于na.locf的方法完成了这一点,其灵感来自this SO post的答案的后半部分。我一直在努力将Geo匹配逻辑融入到这种方法中; na.locf的本质使得这项工作难以正确,因为它依赖于向量来映射在合并步骤之前绑定的NAs。
答案 0 :(得分:2)
有时很难避免循环,特别是当你有像你这样的条件时。有时我们最终会花费大量精力来避免它们,而它们可能是我们能做的最好的,或者在性能和/或可读性方面都不会太落后。话虽如此,这可以解决问题:
df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)
results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA
for(i in seq_len(nrow(df.activ))) {
diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
for(j in seq_along(diffs)) {
if(df.events$Geo[diffs[j]] == 999) {
results[i, 5:7] <- df.events[diffs[j],]
break
} else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
results[i, 5:7] <- df.events[diffs[j],]
break
}
}
}
results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")
results
DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime events.DateTime
1 2014-07-01 00:11:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
2 2014-07-01 00:11:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
3 2014-07-01 00:11:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
4 2014-07-01 00:11:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
5 2014-07-01 00:11:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
6 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
7 2014-07-01 00:12:00 510 0 1 40 510 1404187920 2014-07-01 00:12:00
8 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
9 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618 1 1 225 999 1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510 0 1 40 510 1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00 NA 0 0 225 999 1404187740 2014-07-01 00:09:00
答案 1 :(得分:0)
我在工作,这似乎相对解决,所以我会简短。您还可以执行完全外部合并,然后只需考虑日期中的差异。使用distinct按日期差异的绝对值排序。
这可能是算法最快的合并方式,但需要更多的RAM而不是循环(你的完整合并将有n1 * n2个观察点)。