识别R中指定距离内的点

时间:2014-01-07 21:48:17

标签: r

从10,000个带小数点坐标的站点列表中,我尝试根据这些站点之间计算的距离来识别彼此相距不超过100英尺的站点,并创建这些站点的子集。在最后的清单中,我希望得到彼此相距不超过100英尺的台站的名称,它们的纬度和经度以及它们之间的距离。

我在其他平台上找到了类似的问题,例如mathworks(使用rangesearch)或SQL或JAVA,但在R中没有。

有没有办法在R中这样做?我找到的最接近的答案是Listing number of obervations by location,其中列出了距离内的观测数量,但似乎答案不完整,无法确定彼此相距特定距离的电台。

基本上我想弄清楚哪些电台位于同一地点。

我真的很感激任何帮助。

3 个答案:

答案 0 :(得分:7)

这里只是一些随机的示例数据

set.seed(1234)
x= sample(1:100,50)
y= sample(1:100,50)
M=cbind(x,y)
plot(M)

enter image description here

您可以将距离计算为矩阵,以便可以轻松提取原始行。这可以使用带有which = T的arr.ind函数来完成,如下所示:

DM= as.matrix(dist(M))
neighbors=which(DM < 5, arr.ind=T)
neighbors= neighbors[neighbors[,1]!=neighbors[,2]]

因此,您可以识别出小于5个单位的欧几里德距离的点(在消除自我关系之后):

points(M[neighbors,], col="red" )

enter image description here

答案 1 :(得分:7)

两种方法。

第一个使用earth.dist(...)包中的fossil创建距离矩阵,然后利用data.tables来组合结果表。

第二个使用distHaversine(...)包中的geosphere来计算距离并在一个步骤中组装最终的共置表。后一种方法可能会或可能不会更快,但肯定会更有效,因为它从不存储全距离矩阵。此外,此方法可以在geosphere中使用其他距离度量,例如distVincentySphere(...)distVincentyEllipsoid(...)distMeeus(...)

请注意,实际距离略有不同,可能是因为earth.dist(...)distHaversine(...)对地球半径的估算略有不同。另请注意,此处的两种方法都依赖于ID的站号。如果电台有名称,则必须稍微修改代码。

第一种方法:使用earth.dist(...)

df = read.table(header=T,text="long lat
                1 -74.20139 39.82806
                2 -74.20194 39.82806 
                3 -74.20167 39.82806 
                4 -74.20197 39.82824 
                5 -74.20150 39.82814 
                6 -74.26472 39.66639 
                7 -74.17389 39.87111 
                8 -74.07224 39.97353 
                9 -74.07978 39.94554")              # your sample data
library(fossil)                                     # for earth.dist(...)
library(data.table)
sep.ft   <- 200                                     # critical separation (feet)
sep.km   <- sep.ft*0.0003048                        # critical separation (km)
m        <- as.matrix(earth.dist(df))               # distance matrix in km
coloc    <- data.table(which(m<sep.km, arr.ind=T))  # pairs of stations with dist<200 ft
setnames(coloc,c("row","col"),c("ST.1","ST.2"))     # rename columns to reflect station IDs
coloc    <- coloc[ST.1<ST.2,]                       # want only lower triagular part
coloc[,dist:=m[ST.1,ST.2]/0.0003048,by="ST.1,ST.2"] # append distances in feet
remove(m)                                           # don't need distance matrix anymore...
stations <- data.table(id=as.integer(rownames(df)),df)
setkey(stations,id)
setkey(coloc,ST.1)
coloc[stations,c("long.1","lat.1"):=list(long,lat),nomatch=0]
setkey(coloc,ST.2)
coloc[stations,c("long.2","lat.2"):=list(long,lat),nomatch=0]

产生这个:

coloc
#     ST.1 ST.2      dist    long.1    lat.1    long.2    lat.2
#  1:    1    2 154.13436 -74.20139 39.82806 -74.20194 39.82806
#  2:    1    3  78.46840 -74.20139 39.82806 -74.20167 39.82806
#  3:    2    3  75.66596 -74.20194 39.82806 -74.20167 39.82806
#  4:    1    4 175.31180 -74.20139 39.82806 -74.20197 39.82824
#  5:    2    4  66.22069 -74.20194 39.82806 -74.20197 39.82824
#  6:    3    4 106.69018 -74.20167 39.82806 -74.20197 39.82824
#  7:    1    5  42.45634 -74.20139 39.82806 -74.20150 39.82814
#  8:    2    5 126.71608 -74.20194 39.82806 -74.20150 39.82814
#  9:    3    5  55.87449 -74.20167 39.82806 -74.20150 39.82814
# 10:    4    5 136.67612 -74.20197 39.82824 -74.20150 39.82814

第二种方法:使用distHaversine(...)

library(data.table)
library(geosphere)
sep.ft   <- 200                       # critical separation (feet)
stations <- data.table(id=as.integer(rownames(df)),df)

d <- function(x){                     # distance between station[i] and all subsequent stations
  r.ft <- 6378137*3.28084             # radius of the earth, in feet
  if (x[1]==nrow(stations)) return()  # don't process last row
  ref <- stations[(x[1]+1):nrow(stations),]
  z <- distHaversine(ref[,2:3,with=F],x[2:3], r=r.ft)
  z <- data.table(ST.1=x[1], ST.2=ref$id, dist=z, long.1=x[2], lat.1=x[3], long.2=ref$long, lat.2=ref$lat)
  return(z[z$dist<sep.ft,])
}
coloc.2 = do.call(rbind,apply(stations,1,d))

产生这个:

coloc.2
#     ST.1 ST.2      dist    long.1    lat.1    long.2    lat.2
#  1:    1    2 154.26350 -74.20139 39.82806 -74.20194 39.82806
#  2:    1    3  78.53414 -74.20139 39.82806 -74.20167 39.82806
#  3:    1    4 175.45868 -74.20139 39.82806 -74.20197 39.82824
#  4:    1    5  42.49191 -74.20139 39.82806 -74.20150 39.82814
#  5:    2    3  75.72935 -74.20194 39.82806 -74.20167 39.82806
#  6:    2    4  66.27617 -74.20194 39.82806 -74.20197 39.82824
#  7:    2    5 126.82225 -74.20194 39.82806 -74.20150 39.82814
#  8:    3    4 106.77957 -74.20167 39.82806 -74.20197 39.82824
#  9:    3    5  55.92131 -74.20167 39.82806 -74.20150 39.82814
# 10:    4    5 136.79063 -74.20197 39.82824 -74.20150 39.82814

答案 2 :(得分:1)

我遇到了这个解决方案,该解决方案主要是使用k最近邻算法来查找距离内的所有点。它比使用dist函数要有效得多,该函数在大型网格上计算可能会很长,但是它只为您提供一个点的邻居,如果要在每个点上都这样做会很昂贵。主要优点是,在您实际上仅搜索一部分网格时,相邻优先项避免了计算整个网格的距离。

请注意,我没有考虑从经度/纬度到坐标X / Y的转换。

Result of finding neighbours within a circle

唯一的缺点是您必须确保选择足够多的邻居来寻找。这意味着在搜索邻居之前要对您的网格进行一点分析(即,平均而言,您希望在您的兴趣点周围一定半径内有多少邻居?)

if (!requireNamespace('FNN', quietly = TRUE)) install.packages('FNN')
knn_circle <- function(coordinates, vars = c('x', 'y'),
                   target = numeric(2), r = numeric(0), k = 10){

  # Find the row index of the target point
  target_row_number <- which(coordinates[[vars[1]]] == target[1] & 
    coordinates[[vars[2]]] == target[2])

  # Get k-nearest neighbours matrixes for all points in `coordinates`
  neighbours <- FNN::get.knn(data = coordinates[ , vars], k = k) 

  # Find col indexes of neighbours of target point that have a distance smaller
  # than `r`in nn.dist object
  neighbours_col_indexes <- which(neighbours$nn.dist[target_row_number, ] <= r)

  # Get the row indexes in `coordinates` of the neighbours from nn.index object
  neighbours_row_indexes <- neighbours$nn.index[target_row_number, 
    neighbours_col_indexes]

  # Uncomment to get also the target_point itself
  # neighbours_row_indexes <- c(target_row_number, neighbours_row_indexes)

  # Return the input data with only rows from the neighbours
  coordinates[neighbours_row_indexes, ]
}

这将返回输入网格,其中包括所有列以及仅位于输入目标点周围一定距离内的点。这是一个例子

test_grid <- expand.grid(
  x = runif(n = 100, max = 10),
  y = runif(n = 50, max = 10)
)
test_grid$z <- paste('station', row.names(test_grid))

# Input target point as vector
target_point <- unlist(test_grid[5, c('x','y')])

within_stations <- knn_circle(
  coordinates = test_grid,
  target = target_point,
  r = 2, k = 1000
)

最后,此代码使您可以使用that answer中的圆函数来可视化正在发生的事情。

circleFun <- function(center = c(0,0), r = 1, npoints = 100){
  tt <- seq(0,2*pi,length.out = npoints)
  xx <- center[1] + r * cos(tt)
  yy <- center[2] + r * sin(tt)
  return(data.frame(x = xx, y = yy))
}

if (!requireNamespace('ggplot2', quietly = True)) install.packages('ggplot2')
ggplot(mapping = aes(x = x, y = y)) +
  # Draw whole grid
  geom_point(data = test_grid, color = '#666666',
             size = 0.5, alpha = 0.5) +
  # Draw circle
  geom_point(data = circleFun(center = target_point, r = 2, npoints = 1000),
             color = '#333333', size = 0.5) +
  # Draw within circle grid
  geom_point(data = within_stations, color = 'darkred', size = 0.5) +
  # Alleviate theme
  theme(plot.background = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank()) +
  labs(x = '', y = '')

小心选择足够多的邻居。这是相同的结果图片,但带有k = 50,对于50x100均匀的网格和相当大的半径来说,这太少了。

The algorithm misses some points because of too few neighbours considered