R,计算两个数据集的最小欧氏距离,并自动标记

时间:2018-05-23 11:09:54

标签: r euclidean-distance

我正在使用Euclidean Distance与一对数据集合作。 首先,我的数据。

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

我的目标是为points中的每个点找到与centers中所有中心的最小距离,并将中心名称附加到{{1}数据集(显然是最小的数据集),并自动执行此过程。

所以我从基地开始:

points

我心里想知道它应该如何运作,但我无法管理如何让它自动化。

  1. 选择一行#Euclidean distance sqrt(sum((x-y)^2)) ,以及points
  2. 的所有行
  3. 计算centers
  4. 行和每行之间的欧几里德距离
  5. 选择最小距离
  6. 附上最小距离的标签
  7. 重复第二行......直至centers
  8. 结束

    所以我设法手动完成,让所有步骤自动完成:

    points

    问题是我无法自动管理它。您是否有任何想法为# 1. x = (points[1,2:3]) # select the first of points y1 = (centers[1,1:2]) # select the first center y2 = (centers[2,1:2]) # select the second center y3 = (centers[3,1:2]) # select the third center y4 = (centers[4,1:2]) # select the fourth center # 2. # then the distances distances <- data.frame(distance = c( sqrt(sum((x-y1)^2)), sqrt(sum((x-y2)^2)), sqrt(sum((x-y3)^2)), sqrt(sum((x-y4)^2))), center = centers$center ) # 3. # then I choose the row with the smallest distance d <- distances[which(distances$distance==min(distances$distance)),] # 4. # last, I put the label near the point cbind(points[1,],d) # 5. # then I restart for the second point 的每个点自动执行此过程? 此外,我是否重新发明轮子,即它是否存在我不知道的更快的程序(作为一种功能)?

2 个答案:

答案 0 :(得分:2)

centers <- data.frame(x_ce = c(300,180,450,500),
                      y_ce = c(23,15,10,20),
                      center = c('a','b','c','d'))

points <- data.frame(point = c('p1','p2','p3','p4'),
                     x_p = c(160,600,400,245),
                     y_p = c(7,23,56,12))

library(tidyverse)

points %>%
  mutate(c = list(centers)) %>%
  unnest() %>%                       # create all posible combinations of points and centers as a dataframe
  rowwise() %>%                      # for each combination
  mutate(d = sqrt(sum((c(x_p,y_p)-c(x_ce,y_ce))^2))) %>%   # calculate distance
  ungroup() %>%                                            # forget the grouping
  group_by(point, x_p, y_p) %>%                            # for each point
  summarise(closest_center = center[d == min(d)]) %>%      # keep the closest center
  ungroup()                                                # forget the grouping

# # A tibble: 4 x 4
#   point   x_p   y_p closest_center
#   <fct> <dbl> <dbl> <fct>         
# 1 p1      160     7 b             
# 2 p2      600    23 d             
# 3 p3      400    56 c             
# 4 p4      245    12 a

答案 1 :(得分:1)

使用dplyr包,您可以使用group_by循环遍历每个点,mutate以形成距离列表,将distance设置为列表的最小值,并将center设置为最小距离中心的名称。对于重复行或点名称的情况,我已经包含了两种备选方案。

    library(dplyr)
   centers <- data.frame(x_ce = c(300,180,450,500),
                        y_ce = c(23,15,10,20),
                        center = c('a','b','c','d'))
   points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 245),
                       y_p = c(7,23,56,12, 12))
#
#  If duplicate rows need to be removed
#
  result1 <- points %>% group_by(point) %>%  distinct() %>% 
                                  mutate(lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                                  distance=min(unlist(lst)), 
                                  center = centers$center[which.min(unlist(lst))]) %>%
             select(-lst)

给出结果

# A tibble: 4 x 5
# Groups:   point [4]
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a 

#
# Alternative if point names are not unique
#
  points <- data.frame(point = c('p1','p2','p3','p4', "p4"),
                       x_p = c(160,600,400,245, 550),
                       y_p = c(7,23,56,12, 25))
  result2 <- points %>% rowwise() %>%
                    mutate( lst = with(centers, list(sqrt( (x_p-x_ce)^2 + (y_p-y_ce)^2 ) ) ), 
                               distance=min(unlist(lst)), 
                              center = centers$center[which.min(unlist(lst))]) %>%
                    ungroup() %>% select(-lst)

结果

# A tibble: 5 x 5
  point   x_p   y_p distance center
  <fct> <dbl> <dbl>    <dbl> <fct> 
1 p1      160     7     21.5 b     
2 p2      600    23    100.  d     
3 p3      400    56     67.9 c     
4 p4      245    12     56.1 a     
5 p4      550    25     50.2 d