Question

我正在尝试将一个正常运行的嵌套for循环转换为与apply一起使用。我希望这会让它变得更快。（从我读到它应该，虽然并非总是如此）主数据框中有大约150K行要循环...非常耗费时间

我在R中写了一个for循环，检查df1中的date.time是否位于df2中的两个date.times之间，如果df1和df2中的代码匹配，则df2中的位置会粘贴到df1

以下是子集样本数据

df1<-structure(list(date.time = structure(c(1455922438, 1455922445, 
1455922449, 1455922457, 1455922459, 1455922461), class = c("POSIXct", 
"POSIXt"), tzone = ""), code = c(32221, 32222, 32221, 32222, 
32222, 32221)), .Names = c("date.time", "code"), row.names = 50000:50005, class = "data.frame")

df2<-structure(list(Location = 11:12, Code = 32221:32222, t_in = structure(c(1455699600, 
1455699600), class = c("POSIXct", "POSIXt"), tzone = ""), t_out = structure(c(1456401600, 
1456401600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Location", 
"Code", "t_in", "t_out"), class = "data.frame", row.names = 11:12)

For循环可以正常工作，但需要很长时间：

for (i in 1:nrow(df1)[1]){
  for (j in 1:nrow(df2)){
    ifelse(df1$code[i] == df2$Code[j]
           & df1$date.time [i] < df2$t_out [j]
           & df1$date.time [i] > df2$t_in [j],
           df1$Location [i] <- df2$Location [j],
           NA)
  }
}

我已经做到了这一点：

ids <- as.numeric(df2$Location)
f <- function(x){
  a <- ids[ (df2$t_in < x) & (x < df2$t_out)  ]
  if (length(a) == 0 ) NA else a
}   

df1$Location <- lapply(df1$date.time, f)

这会返回两个数字，因为df1中的date.time位于t_in和t_out之间，因此在粘贴位置时，为什么要求每个数据帧中的代码匹配

任何指针都非常赞赏

Answer 1

包SELECT ..., DATA.EXT_ID AS NUMBER, ... FROM DATA;具有重叠的范围连接，可以非常快速地完成此操作。您正在寻找的功能是data.table。以下是使用foverlaps之前进行一些清洁的示例：

foverlaps

给出了输出：

require(data.table)

dt1 <- data.table(df1)
dt2 <- data.table(df2)

## need to create a range in dt 1 to find overlaps on
dt1[,start:=date.time]
dt1[,end:=date.time]

## clean up names to match each other
setnames(dt2,c("Location","Code","start","end"))
setnames(dt1,c("code"),c("Code"))

setkey(dt1,Code,start,end)
setkey(dt2,Code,start,end)

## use foverlaps with the additional matching variable Code
out <- foverlaps(dt1,dt2,type="any",
                 by.x=c("Code","start","end"),
                 by.y=c("Code","start","end"))

## more renaming and selection of the same subset of columns
setnames(out,"i.start","date.time")
out <- out[,.(date.time,Code,Location)]

Answer 2

我尝试构建一个不依赖于for或apply的“无循环”版本。看看它是否更快：

trans <- which( outer(X=df1$code, Y=df2$Code,'==') & 
                outer(df1$date.time , df2$t_in, ">") & 
                outer(df1$date.time, df2$t_out , "<")  , arr.ind=TRUE)
df1$Location [ trans[,1] ] <- df2$Location [ trans[,2] ]
df1
#------
                date.time  code Location
50000 2016-02-19 14:53:58 32221       11
50001 2016-02-19 14:54:05 32222       12
50002 2016-02-19 14:54:09 32221       11
50003 2016-02-19 14:54:17 32222       12
50004 2016-02-19 14:54:19 32222       12
50005 2016-02-19 14:54:21 32221       11

对外部的三次调用将在i个j矩阵构建TRUE，当满足三个不同的条件时AND，它们which( . , arr.ind=TRUE) - 以达到共同的满意度结果，然后i返回一个矩阵，其中j值位于第一列，[<-值位于第二列，因此可以使用普通UIImage分配相应的向量。

如何使嵌套for循环更有效并与apply一起使用

2 个答案: