Question

我有两个CSV文件，分别包含位置（1100万行，三列：“ lid”，“ lat”，“ lon”）和设施（50k行列“ fid”，“ lat”，“ lon”）的坐标）。对于每个位置，我需要计算到最近设施的最小距离。

我知道如何使用R中的“ st_distance”来执行此操作。但是，“ st_distance”正在花费很多时间，因为它首先计算距离的完整矩阵并且两个文件都很大。我尝试将位置文件介绍性小组分成几个小组，并在3个内核中使用“ future_map”，这比我预期的要花费更多的时间。有没有办法加快这个过程？

Answer 1

您是否考虑过先使用st_buffer？这将限制您需要搜索以找到最近位置的位置数量。例如，以10英里的半径开始，看看是否捕获了所有数据。如果这样不起作用，请尝试使用findNeighbors（）函数。请参阅文档https://www.rdocumentation.org/packages/fractal/versions/2.0-4/topics/findNeighbors

将来，如果您提供数据样本也很好。

Answer 2

我确信必须有更好的方法来做到这一点，但这就是我的方法。希望对您有所帮助。

library(tidyverse)
library(furrr)


MILLION_1 <- 10^6
K_50 <- 10^4*5

# dummy data --------------------------------------------------------------------

d_1m <- 
  tibble(
    lid_1m = 1:MILLION_1,
    long_1m = abs(rnorm(MILLION_1) * 100),
    lat_1m = abs(rnorm(MILLION_1)) * 100
  )


d_50k <- 
  tibble(
    lid_50k= 1:K_50,
    long_50k = abs(rnorm(K_50) * 100),
    lat_50k = abs(rnorm(K_50) * 100)
  )


# distance calculation for each facility ------------------------------------------

future::plan(multiprocess)

d_distance <- 
  # take one row of facility: long,lat and id as an input
  future_pmap_dfr(d_50k, function(...){
  d50_row <- tibble(...)
  # to calculate distance between one facility location and 1 million other locations 
  d <- tidyr::crossing(d_1m, d50_row)
  
  d %>% 
    mutate(
      #euclidean distance
      distance = sqrt((long_1m - long_50k)^2 + (lat_1m - lat_50k)^2)
      ) %>% 
    # to get the location which is the closest to the facility
    filter(distance == min(distance))

})

计算从大量地点到设施的最小距离的最快方法

2 个答案: