Question

我具有以下数据帧df，我想添加一列，该列的距离应与每行最接近的非NA值的距离。

df <- data.frame(x = 1:20)
df[c(1, 3, 4, 5, 11, 14, 15, 16), "x"] <-  NA

换句话说，我正在寻找以下值：

df$distance <- c(1, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0, 0)

如何自动执行此操作？

Answer 1

让x是包含NA的向量，您的问题是

a <- which(!is.na(x))
b <- which(is.na(x))

为每个min(abs(a - b[i]))查找b[i]。

使用R代码很难轻松有效地有效完成这类任务。用编译的代码编写循环通常是一个更好的选择。除非某个软件包中的某些功能已经为我们完成了此任务。

以下是一些幼稚但直接的解决方案。

如果x不太长，我们可以使用outer：

distance <- numeric(length(x))
distance[is.na(x)] <- apply(abs(outer(a, b, "-")), 2L, min)

如果时间太长而outer的内存使用成为问题，我们可能会这样做

distance <- numeric(length(x))
distance[is.na(x)] <- sapply(b, function (bi) min(abs(bi - a)))

请注意，鉴于该算法，没有一种方法真正有效。

Answer 2

这是使用rle和rank的另一种方法：

library(dplyr)
library(magrittr)

df <- data.frame(x=seq(1, 20))
df[c("1", "3", "4", "5", "11", "14", "15", "16"), 1] <-  NA

rle.len <- df$x %>% is.na %>% rle %$% lengths

df %>% 
  mutate(na.seq=rle.len %>% seq_along %>% rep(rle.len)) %>% 
  group_by(na.seq) %>%
  mutate(distance=ifelse(is.na(x), pmin(rank(na.seq, ties.method = "first"),
                                        rank(na.seq, ties.method = "last")), 0))

    x na.seq distance
1  NA      1        1
2   2      2        0
3  NA      3        1
4  NA      3        2
5  NA      3        1

Answer 3

您可以使用findInterval。首先，找到NA和non-NA值的索引，并初始化距离列：

na <- which(is.na(df$x))
non_na <- which(!is.na(df$x))
df$distance2 <- 0

然后，将findInterval与非NA索引的中点一起用作中断，以查找属于哪个区间NA的索引。使用间隔来提取相应的非NA索引，计算NA索引的绝对差，然后将它们分配到NA索引：

df$distance2[na] <- abs(na - non_na[findInterval(na, (non_na[-length(non_na)] + non_na[-1]) / 2) + 1])

df
#     x distance distance2
# 1  NA        1         1
# 2   2        0         0
# 3  NA        1         1
# 4  NA        2         2
# 5  NA        1         1
# 6   6        0         0
# 7   7        0         0
# 8   8        0         0
# 9   9        0         0
# 10 10        0         0
# 11 NA        1         1
# 12 12        0         0
# 13 13        0         0
# 14 NA        1         1
# 15 NA        2         2
# 16 NA        1         1
# 17 17        0         0
# 18 18        0         0
# 19 19        0         0
# 20 20        0         0

Answer 4

一种方法是在使用distance()软件包将矩阵转换为raster的RasterLayer对象之后，在raster() function软件包中使用https://www.django-rest-framework.org/api-guide/relations/#nested-relationships。

该软件包专用于地图，因此当您使用raster()时，对象将具有单位，分辨率等。因此，当您使用distance()时，元素距离可能会非常大。距离非NA（对我而言为15796.35）。只需除以该金额（由于四舍五入的误差，可能除以round()）即可得到答案。

作为一个例子，如果我有一个NA为a1的数组对象：

> a1 = array(
    c(
       c(1, 5, 6, NA, 1, 2, 5),
       c(3, 4, NA, NA, NA, 8, 1),
       c(5, 1, 7, NA, 2, 3, 7),
       c(8, 1, 1, 2, 3, 6, 2)
     ),
    c(7, 4)
  )
> r1 = raster(a1)
> d1 = distance(r1)
> as.matrix(d1)    

         [,1]     [,2]     [,3] [,4]
[1,]     0.00     0.00     0.00    0
[2,]     0.00     0.00     0.00    0
[3,]     0.00 15796.35     0.00    0
[4,] 15796.33 31592.66 15796.33    0
[5,]     0.00 15796.33     0.00    0
[6,]     0.00     0.00     0.00    0
[7,]     0.00     0.00     0.00    0

> round(
     as.matrix(d1) / 15796.35,
     0
  )

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0    1    0    0
[4,]    1    2    1    0
[5,]    0    1    0    0
[6,]    0    0    0    0
[7,]    0    0    0    0

这是您的答案。不过，我不知道distance()函数背后的代码的效率如何，所以我不知道它是否需要一段时间。

编辑：在具有29000个NA的数组对象上进行测试，这需要很长时间。我建议您只将其用于具有少量NA的对象。

距数据框中最接近的非NA值的距离

4 个答案: