Question

我正在通过最近邻居搜索对两个大型，固定大小（经纬度）坐标数据集进行SQL样式连接。目前，我正在使用dplyr和data.table来执行此操作。如何为绝对运行时优化和并行化我的代码？

先前的尝试包括本机python，pandas和多处理，但最终都非常缓慢。我当前的解决方案是使用data.table构造最近邻居的表，并使用dplyr基于该表进行连接，这是最快的方法，但仍然太慢。

library(dplyr)
library(data.table)
library(geosphere)

source <- data.table(lat = runif(1e3), long = runif(1e3)) %>% mutate(nrow = row_number())
dest <- data.table(lat = runif(5e4), long = runif(5e4)) %>% mutate(ind = row_number())
dest_mat <- as.matrix(dest[, c('long', 'lat')])
setDT(source)
# function that returns the index of the closest matching point in dest
mindist_ind <- function(source_lat, source_long) { return(which.min(distHaversine(c(source_long, source_lat), dest_mat))) }


nn_inds <- source[, j = list(ind = mindist_ind(lat, long)), by = 1:nrow(source)] # slowest line, gets index of nearest match in dest
nn_matches <- inner_join(nn_inds, dest, by = 'ind') # join final back to dest in order to get all nearest matches
sourcedest_matches <- inner_join(source, nn_matches, by = 'nrow') # join nearest matches to source by index

源文件约为8900万行，dest约为5万行。各种信号源的当前时间如下：

1000行-> 46秒
10000行-> 270秒
100000行-> 2580秒
1000000行-> 17172秒

虽然这是我所能获得的最快速度，但对于完整的8900万个源文件而言，估计需要17-18天才能运行，这太长了。我正在具有488 GB RAM，32个核心和64个vCPU的r4.16xlarge AWS EC2实例上运行此程序。如何优化/并行化此代码以更快地运行？

Answer 1

我假设您在问题中提供的代码实际上并不是您想要的。您的代码计算source和dest的成对行之间的距离，并循环source以匹配dest的长度。

您可能想要的以及此答案提供的内容是，针对dest中的每个点查找source中的最近点。（请参阅我对您问题的评论）

计算距离矩阵的计算量很大。假设R程序包在计算距离矩阵方面效率几乎相同，那么实际上唯一加快速度的方法就是在距离矩阵计算上并行化。不幸的是，具有更多行的矩阵是参考点，因为并行化只能发生在源点的子集上。（即，您需要考虑所有dest点，以找到与任何给定dest最接近的source点）

library(parallel)
library(sp)
#nonparallel version
x2 <- copy(source)
temp <- spDists(x2[, .(long,lat)],dest_mat,longlat=TRUE)
system.time(final2 <- x2[, c("long_dest","lat_dest"):=as.list(dest_mat[apply(temp,1,which.min),]) ])

#parallel version

source_l <- split(source, rep(1:10,each=100))

cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, "dest_mat") #this is slow but I don't think there's a way to get around this

system.time(
  v <- parLapply(cl=cl, X=source_l, fun=function(x){
    library(sp)
    library(data.table)
    temp <- spDists(x[, .(long,lat)],dest_mat,longlat=TRUE)
    x[, c("long_dest","lat_dest"):=as.list(dest_mat[apply(temp,1,which.min),]) ]
    x
  })
)

stopCluster(cl)

final <- rbindlist(v)
identical(final[order(nrow)],final2)

您需要尝试使用32个以上的进程是否实际上可以加快速度。超线程可能是一个混血儿，要预测它是否有任何好处并不总是那么容易。不幸的是，不能保证您将有足够的RAM运行最佳数量的进程。这不仅速度慢，而且占用大量内存。如果出现错误指示内存不足，则需要减少进程数或租用具有更多内存的EC2计算机。

最后，我将注意到，如果有联系，则which.min返回第一个最小值的索引。因此，结果将取决于dest_mat中的行顺序。

通过最近的坐标最快（并行）连接大型数据集？

1 个答案: