Question

作为一个愚蠢的玩具例子，假设

x=4.5
w=c(1,2,4,6,7)

我想知道是否有一个简单的R函数可以在x中找到与w最匹配的索引。因此，如果foo是该函数，foo(w,x)将返回3。函数match是正确的想法，但似乎仅适用于完全匹配。

解决方案here（例如which.min(abs(w - x))，which(abs(w-x)==min(abs(w-x)))等）都是O(n)而非log(n)（我假设w已经排序了。）

Answer 1

您可以使用data.table进行二分搜索：

dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w")  # let data.table know that w is sorted

请注意，如果列w尚未排序，则您必须使用setkey(dt, w)代替setattr(.)。

# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
#     w val
#1: 4.5   4

在最终表达式中，val列将包含您正在寻找的内容。

# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
#     w .I
#1: 4.5  3

# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3

Answer 2

R>findInterval(4.5, c(1,2,4,5,6))
[1] 3

将使用价格合适的匹配（最接近而不会过去）。

Answer 3

请参见MALDIquant软件包中的match.closest()：

> library(MALDIquant)
> match.closest(x, w)
[1] 3

Answer 4

为了在角色向量上执行此操作，Martin Morgan在R-help上建议使用此功能：

bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

Answer 5

x = 4.5
w = c(1,2,4,6,7)

closestLoc = which(min(abs(w-x)))
closestVal = w[which(min(abs(w-x)))]

# On my phone- please pardon typos

如果您的矢量很长，请尝试两步法：

x = 4.5
w = c(1,2,4,6,7)

sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))

对于令人发狂的长向量（数百万行！，警告 - 对于不是非常非常非常大的数据，这实际上会更慢。）

require(doMC)
registerDoMC()

closestLoc = which(min(foreach(i = w) %dopar% {
   abs(i-x)
}))

此示例仅为您提供在拥有大量数据时利用并行处理的基本概念。注意，我不建议你将它用于简单和像abs（）这样的快速函数。

Answer 6

NearestValueSearch = function(x, w){
  ## A simple binary search algo
  ## Assume the w vector is sorted so we can use binary search
  left = 1
  right = length(w)
  while(right - left > 1){
    middle = floor((left + right) / 2)
    if(x < w[middle]){
      right = middle
    }
    else{
      left = middle
    }
  }
  if(abs(x - w[right]) < abs(x - w[left])){
    return(right)
  }
  else{
    return(left)
  }
}


x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3

Answer 7

基于@ neal-fultz答案，这是一个使用findInterval()的简单函数：

get_closest_index <- function(x, vec){
  # vec must be sorted
  iv <- findInterval(x, vec)
  dist_left <- x - vec[ifelse(iv == 0, NA, iv)]
  dist_right <- vec[iv + 1] - x
  ifelse(! is.na(dist_left) & (is.na(dist_right) | dist_left < dist_right), iv, iv + 1)
}
values <- c(-15, -0.01, 3.1, 6, 10, 100)
grid <- c(-2, -0.1, 0.1, 3, 7)
get_closest_index(values, grid)
#> [1] 1 2 4 5 5 5

^{由reprex package（v0.3.0）于2020-05-29创建}

Answer 8

您始终可以实现自定义二进制搜索算法以查找最接近的值。或者，您可以利用libc bsearch（）的标准实现。您也可以使用其他二进制搜索实现，但它不会改变您必须仔细实现比较函数以找到数组中最接近的元素的事实。标准二进制搜索实现的问题在于它用于精确比较。这意味着你的即兴比较功能需要做某种 exactification 来弄清楚数组中的元素是否足够接近。为了实现它，比较函数需要了解数组中的其他元素，尤其是以下几个方面：

当前元素的位置（与元素进行比较的元素）键）。
与钥匙的距离以及与邻居的比较（之前的或下一个元素）。

为了在比较功能中提供这些额外的知识，密钥需要与其他信息（不仅仅是键值）打包在一起。一旦比较函数了解了这些方面，就可以确定元素本身是否最接近。当它知道它是最接近的时，它返回＆＃34;匹配＆＃34;。

以下C代码找到最接近的值。

#include <stdio.h>
#include <stdlib.h>

struct key {
        int key_val;
        int *array_head;
        int array_size;
};

int compar(const void *k, const void *e) {
        struct key *key = (struct key*)k;
        int *elem = (int*)e;
        int *arr_first = key->array_head;
        int *arr_last = key->array_head + key->array_size -1;
        int kv = key->key_val;
        int dist_left;
        int dist_right;

        if (kv == *elem) {
                /* easy case: if both same, got to be closest */
                return 0;
        } else if (key->array_size == 1) {
                /* easy case: only element got to be closest */
                return 0;
        } else if (elem == arr_first) {
                /* element is the first in array */
                if (kv < *elem) {
                        /* if keyval is less the first element then
                         * first elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between first and 2nd elem.
                         * if distance with first elem is smaller, it is closest.
                         */
                        dist_left = kv - *elem;
                        dist_right = *(elem+1) - kv;
                        return (dist_left <= dist_right) ? 0:1;
                }
        } else if (elem == arr_last) {
                /* element is the last in array */
                if (kv > *elem) {
                        /* if keyval is larger than the last element then
                         * last elem is closest.
                         */
                        return 0;
                } else {
                        /* check distance between last and last-but-one.
                         * if distance with last elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem-1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* condition for remaining cases (other cases are handled already):
         * - elem is neither first or last in the array
         * - array has atleast three elements.
         */

        if (kv < *elem) {
                /* keyval is smaller than elem */

                if (kv <= *(elem -1)) {
                        /* keyval is smaller than previous (of "elem") too.
                         * hence, elem cannot be closest.
                         */
                        return -1;
                } else {
                        /* check distance between elem and elem-prev.
                         * if distance with elem is smaller, it is closest.
                         */
                        dist_left = kv - *(elem -1);
                        dist_right = *elem - kv;
                        return (dist_right <= dist_left) ? 0:-1;
                }
        }

        /* remaining case: (keyval > *elem) */

        if (kv >= *(elem+1)) {
                /* keyval is larger than next (of "elem") too.
                 * hence, elem cannot be closest.
                 */
                return 1;
        }

        /* check distance between elem and elem-next.
         * if distance with elem is smaller, it is closest.
         */
        dist_right = *(elem+1) - kv;
        dist_left = kv - *elem;
        return (dist_left <= dist_right) ? 0:1;
}


int main(int argc, char **argv) {
        int arr[] = {10, 20, 30, 40, 50, 60, 70};
        int *found;
        struct key k;

        if (argc < 2) {
                return 1;
        }

        k.key_val = atoi(argv[1]);
        k.array_head = arr;
        k.array_size = sizeof(arr)/sizeof(int);

        found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
                compar);

        if(found) {
                printf("found closest: %d\n", *found);
        } else {
                printf("closest not found. absurd! \n");
        }

        return 0;
}

不用说上面例子中的bsearch（）永远不会失败（除非数组大小为零）。

如果你实现自己的自定义二进制搜索，基本上你必须在二进制搜索代码的主体中嵌入相同的比较逻辑（而不是在上面的例子中比较函数中有这个逻辑）。

使用二分搜索在向量中查找最接近的值

8 个答案: