Question

我有两个数据集，一个具有超过1300万个矩形多边形（一组4个经点），而另一个具有一万个点，表示该位置的价格。

我想为每个多边形生成适合每个多边形的所有点之间的均价。

现在我正在使用> polygons id pol_lat pol_lng 1: 148 -4.250236,-4.250236,-4.254640,-4.254640 -49.94628,-49.94494,-49.94494,-49.94628 2: 149 -4.254640,-4.254640,-5.361601,-5.361601 -49.94494,-49.07906,-49.07906,-49.94494 3: 150 -5.361601,-5.361601,-5.212208,-5.212208 -49.07906,-49.04469,-49.04469,-49.07906 4: 151 -5.212208,-5.212208,-5.002878,-5.002878 -49.04469,-48.48664,-48.48664,-49.04469 5: 152 -5.002878,-5.002878,-5.080018,-5.080018 -48.48664,-48.43699,-48.43699,-48.48664 6: 153 -5.080018,-5.080018,-5.079819,-5.079819 -48.43699,-48.42480,-48.42480,-48.43699 7: 154 -5.079819,-5.079819,-5.155606,-5.155606 -48.42480,-47.53891,-47.53891,-48.42480 8: 155 -5.155606,-5.155606,-4.954156,-4.954156 -47.53891,-47.50354,-47.50354,-47.53891 9: 156 -4.954156,-4.954156,-3.675864,-3.675864 -47.50354,-45.39022,-45.39022,-47.50354 10: 157 -3.675864,-3.675864,-3.706356,-3.706356 -45.39022,-45.30724,-45.30724,-45.39022 11: 158 -3.706356,-3.706356,-3.705801,-3.705801 -45.30724,-45.30722,-45.30722,-45.30724 > points longitude latitude price 1: -47.50308 -4.953936 3.0616 2: -47.50308 -4.953936 3.2070 3: -47.50308 -4.953936 3.0630 4: -47.50308 -4.953936 3.0603 5: -47.50308 -4.953936 3.0460 6: -47.50308 -4.953936 2.9900 7: -49.07035 -5.283658 3.3130 8: -49.08054 -5.347284 3.3900 9: -49.08054 -5.347284 3.3620 10: -49.21726 -5.338270 3.3900 11: -49.08050 -5.347255 3.4000 12: -49.08042 -5.347248 3.3220 13: -49.08190 -5.359508 3.3130 14: -49.08046 -5.347277 3.3560来获取适合给定多边形内所有点的索引，然后获取其均价

sp::point.in.polygon

但是，这显然很慢。关于如何更快地执行此操作的任何想法，也许使用w <- lapply(1:nrow(polygons), function(tt) { ind <- point.in.polygon(points$latitude, points$longitude, polygons$pol_lat[[tt]], polygons$pol_lng[[tt]]) > 0 med <- mean(points$price[ind]) return(med) } ) > unlist(w) [1] NaN 3.361857 3.313000 NaN NaN NaN NaN NaN 3.071317 NaN NaN或data.table（或任何其他方式）？

数据紧随其后

dplyr

Answer 1

如果您的“多边形”始终是矩形，例如，可以使用SearchTrees包中实现的QuadTree空间索引，以提高识别的速度。哪个点落在每个多边形中。

由于空间索引允许进行的“比较”次数越少，数据集中的点越多，它可以给您带来很大的速度提升。

例如：

library(SearchTrees)
library(magrittr)

# Create a "beefier" test dataset based on your data: 14000 pts 
# over 45000 polygons

for (i in 1:10) points   <- rbind(points, points + runif(length(points)))
for (i in 1:12) polygons <- rbind(polygons, polygons)


# Compute limits of the polygons
min_xs <- lapply(polygons$pol_lng , min) %>% unlist()
max_xs <- lapply(polygons$pol_lng , max) %>% unlist()
min_ys <- lapply(polygons$pol_lat , min) %>% unlist()
max_ys <- lapply(polygons$pol_lat, max) %>% unlist()
xlims <- cbind(min_xs, max_xs)
ylims <- cbind(min_ys, max_ys)

# Create the quadtree
tree = SearchTrees::createTree(cbind(points[1],points[2]))

#☻ extract averages, looping over polygons ----
t1 <- Sys.time()
w <- lapply(1:nrow(polygons), 
            function(tt) {
              ind <- SearchTrees::rectLookup(
                tree, 
                xlims = xlims[tt,],
                ylims = ylims[tt,]))
              mean(points$price[ind])

              })
Sys.time() - t1

时差2.945789秒

w1 <- unlist(w)

在我的旧笔记本电脑上，这种“天真”的实现比原始的测试数据处理快10倍以上：

t1 <- Sys.time()
w <- lapply(1:nrow(polygons),
            function(tt) {
              ind <- sp::point.in.polygon(points$latitude, points$longitude,
                                      polygons$pol_lat[[tt]], polygons$pol_lng[[tt]]) > 0
              med <- mean(points$price[ind])
              return(med)
            }
)
Sys.time() - t1
w2 <- unlist(w)

时差40.36493秒

，结果相同：

> all.equal(w1, w2)
[1] TRUE

整体速度的提高将取决于您的点如何在空间范围内以及相对于多边形进行“聚类”。

考虑一下，如果多边形不是矩形，您也可以利用这种方法，首先提取每个多边形bbox中包含的点，然后再使用更标准的方法在多边形“内部”找到这些点。

还要考虑到任务是尴尬地平行的，因此您可以通过对多边形使用foreach或parlapply方法来轻松地提高性能。

HTH！

在R中的多边形列表中求点的更快方法

1 个答案: