我正在尝试将一个正常运行的嵌套for循环转换为与apply一起使用。我希望这会让它变得更快。 (从我读到它应该,虽然并非总是如此)主数据框中有大约150K行要循环...非常耗费时间
我在R中写了一个for循环,检查df1中的date.time是否位于df2中的两个date.times之间,如果df1和df2中的代码匹配,则df2中的位置会粘贴到df1
以下是子集样本数据
df1<-structure(list(date.time = structure(c(1455922438, 1455922445,
1455922449, 1455922457, 1455922459, 1455922461), class = c("POSIXct",
"POSIXt"), tzone = ""), code = c(32221, 32222, 32221, 32222,
32222, 32221)), .Names = c("date.time", "code"), row.names = 50000:50005, class = "data.frame")
df2<-structure(list(Location = 11:12, Code = 32221:32222, t_in = structure(c(1455699600,
1455699600), class = c("POSIXct", "POSIXt"), tzone = ""), t_out = structure(c(1456401600,
1456401600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Location",
"Code", "t_in", "t_out"), class = "data.frame", row.names = 11:12)
For循环可以正常工作,但需要很长时间:
for (i in 1:nrow(df1)[1]){
for (j in 1:nrow(df2)){
ifelse(df1$code[i] == df2$Code[j]
& df1$date.time [i] < df2$t_out [j]
& df1$date.time [i] > df2$t_in [j],
df1$Location [i] <- df2$Location [j],
NA)
}
}
我已经做到了这一点:
ids <- as.numeric(df2$Location)
f <- function(x){
a <- ids[ (df2$t_in < x) & (x < df2$t_out) ]
if (length(a) == 0 ) NA else a
}
df1$Location <- lapply(df1$date.time, f)
这会返回两个数字,因为df1中的date.time位于t_in和t_out之间,因此在粘贴位置时,为什么要求每个数据帧中的代码匹配
任何指针都非常赞赏
答案 0 :(得分:3)
包SELECT ..., DATA.EXT_ID AS NUMBER, ... FROM DATA;
具有重叠的范围连接,可以非常快速地完成此操作。您正在寻找的功能是data.table
。以下是使用foverlaps
之前进行一些清洁的示例:
foverlaps
给出了输出:
require(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
## need to create a range in dt 1 to find overlaps on
dt1[,start:=date.time]
dt1[,end:=date.time]
## clean up names to match each other
setnames(dt2,c("Location","Code","start","end"))
setnames(dt1,c("code"),c("Code"))
setkey(dt1,Code,start,end)
setkey(dt2,Code,start,end)
## use foverlaps with the additional matching variable Code
out <- foverlaps(dt1,dt2,type="any",
by.x=c("Code","start","end"),
by.y=c("Code","start","end"))
## more renaming and selection of the same subset of columns
setnames(out,"i.start","date.time")
out <- out[,.(date.time,Code,Location)]
答案 1 :(得分:1)
我尝试构建一个不依赖于for
或apply
的“无循环”版本。看看它是否更快:
trans <- which( outer(X=df1$code, Y=df2$Code,'==') &
outer(df1$date.time , df2$t_in, ">") &
outer(df1$date.time, df2$t_out , "<") , arr.ind=TRUE)
df1$Location [ trans[,1] ] <- df2$Location [ trans[,2] ]
df1
#------
date.time code Location
50000 2016-02-19 14:53:58 32221 11
50001 2016-02-19 14:54:05 32222 12
50002 2016-02-19 14:54:09 32221 11
50003 2016-02-19 14:54:17 32222 12
50004 2016-02-19 14:54:19 32222 12
50005 2016-02-19 14:54:21 32221 11
对外部的三次调用将在i
个j
矩阵构建TRUE
,当满足三个不同的条件时AND
,它们which( . , arr.ind=TRUE)
- 以达到共同的满意度结果,然后i
返回一个矩阵,其中j
值位于第一列,[<-
值位于第二列,因此可以使用普通UIImage
分配相应的向量。