Question

我有一个包含N行的数据框，我想计算一个行的子集，它们是属于同一组的数据集中每个行的最接近的行。

例如：

> df
# A tibble: 8,014 x 4
     A      B       C      Group
    <dbl>  <dbl>   <dbl>    <int>
 1  -0.396 -0.621 -0.759      1
 2  -0.451 -0.625 -0.924      1
 3  -0.589 -0.624 -1.26       1
 4  -0.506 -0.625 -1.09       1
 5  NA      1.59  -0.593      1
 6  -0.286  4.22  -0.0952     1
 7  NA      2.91  -0.0952     1
 8  NA      4.22  -0.924      1
 9  -0.175  1.52  -0.0952     1
10  NA      1.74   1.56       1
# ... with 8,004 more rows

因此，例如，我想检查属于Group == 1的第2行和第3行最接近的行。此外，我必须有效地执行此操作，因此for循环不是一个真正的选项。

我想使用dist功能，因为它具有正确处理NA的优点，但我不需要计算整个距离矩阵 - 这将是一种浪费

我尝试了这个，但它失败了，也浪费了：

res = Map(function(x,y) dist(as.matrix(rbind(x, y))), df[2:3, ] 
%>% group_by(Group), df %>% group_by(Group))

Answer 1

一种方法，但它确实为每个组创建了整个距离矩阵。考虑到你正在尝试做什么，不知道为什么这是浪费：

library(tidyverse)
library(purrr)

min_dist <- function(x){

  dist(x, upper = T) %>% 
    as.matrix %>% 
    as.tibble %>% 
    na_if(0) %>%  #as.tibble adds zeros along the diagonal, so this removes them
    summarize_all(funs(which(. == min(.,na.rm=TRUE)))) %>% 
    gather %>% 
    pull(value)
}


df %>% group_by(Group) %>%
  mutate(group_row = row_number()) %>%
  nest(-Group) %>% 
  mutate(nearest_row = map(data, min_dist)) %>% 
  unnest

在R＆lt; base / dplyr中，找到最接近数据集每行的行

1 个答案: