Question

我有两组点存储在R中作为sf对象。点对象x包含204,467，点y包含5,297。

理论上，我想计算x中所有点到y中所有点的距离。我知道这会产生矩阵状的野兽，但是可以在i7桌面上使用sf包中的st_distance（x，y，by_element = FALSE）在大约40分钟的时间内完成。

我要做的是计算x中所有点到y中所有点的距离，然后将其转换为data.frame，其中包含x和y的所有变量对点。这是因为我想在使用dplyr进行聚合时具有灵活性，例如，我想找到y中的点数，该点数在x的10、50、100公里之内，并且x $ year

我成功创建了距离矩阵，其中包含约1,083,061,699个像元。我知道这是一种非常低效的方法，但是它在聚合方面提供了灵活性。欢迎其他建议。

下面的代码创建两个SF点对象，并测量它们之间的距离。接下来，我想将其转换为带有来自x和y的所有变量的data.frame，但这是我无法继续进行的地方。

如果我建议的工作流程不可行，有人可以提供替代解决方案来测量到预定义半径内所有点的距离，并使用x和y中的所有变量创建结果的data.frame吗？

# Create two sf point objects 
set.seed(123)
library(sf)


pts1 <- st_as_sf(x = data.frame(id=seq(1,204467,1),
                                year=sample(seq(from = 1990, to = 2018, by = 1), size = 204467, replace = TRUE),
                                xcoord=sample(seq(from = -180, to = 180, by = 1), size = 204467, replace = TRUE),
                                ycoord=sample(seq(from = -90, to = 90, by = 1), size = 204467, replace = TRUE)),
                 coords=c("xcoord","ycoord"),crs=4326)

pts2 <- st_as_sf(x = data.frame(id=seq(1,5297,1),
                                year=sample(seq(from = 1990, to = 2018, by = 1), size = 5297, replace = TRUE),
                                xcoord=sample(seq(from = -180, to = 180, by = 1), size = 5297, replace = TRUE),
                                ycoord=sample(seq(from = -90, to = 90, by = 1), size = 5297, replace = TRUE)),
                 coords=c("xcoord","ycoord"),crs=4326)

distmat <- st_distance(pts1,pts2,by_element = FALSE)

Answer 1

我会考虑采用不同的方法。一旦有了distmat矩阵，就可以执行您描述的计算类型，而无需data.frame。您可以使用标准子集来查找符合您指定条件的点。

例如，要查找pts1$year大于pts2$year的点的组合，我们可以这样做：

subset_points = outer(pts1$year, pts2$year, `>`)

然后，要找出100公里以上有多少个分隔点，我们可以做到

library(units)
sum(distmat[subset_points] > (100 * as_units('km', 1)))

有关内存使用情况的说明

但是，如果您使用sf或data.frame对象来实现此目的，则很可能您会开始违反data.table的每个矩阵或列中具有1e9浮点的RAM限制。您可能会考虑将距离矩阵转换为raster。然后，可以将栅格存储在磁盘上而不是存储在内存中，并且您可以利用raster包中的内存安全功能来解决问题。

我们如何使用栅格从磁盘工作并节省RAM

对于这样的超大型矩阵，我们可以使用内存安全的栅格操作，例如：

library(raster)

# convert our matrices to rasters, so we can work on them from disk
r = raster(matrix(as.numeric(distmat), length(pts1$id), length(pts2$id)))
s = raster(subset_points)
remove('distmat', 'subset_points')

# now create a raster equal to r, but with zeroes in the cells we wish to exclude from calculation
rs = overlay(r,s,fun=function(x,y){x*y}, filename='out1.tif')     

# find which cells have value greater than x (1e6 in the example)
Big_cells = reclassify(rs, matrix(c(-Inf, 1e6, 0, 1e6, Inf, 1), ncol=3, byrow=TRUE), 'out.tiff', overwrite=T)

# and finally count the cells
N = cellStats(Big_cells, sum)

使用st_distance计算两组点之间的所有距离

1 个答案: