如何加速R中的模糊索引匹配(可能使用Rcpp)?

时间:2016-02-26 05:17:59

标签: r interpolation rcpp fuzzy-search

我试图从给定的查找表(lk_tbl)中查找一个值,因为缺少更好的单词,模糊匹配如下:

lk_tbl <- structure(list(num = c(1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35, 
1.4), val = c(0.241970724519143, 0.229882140684233, 0.217852177032551, 
0.205936268719975, 0.194186054983213, 0.182649085389022, 0.171368592047807, 
0.16038332734192, 0.149727465635745)), .Names = c("num", "val"
), row.names = c(NA, -9L), class = "data.frame")

> lk_tbl
   num       val
1 1.00 0.2419707
2 1.05 0.2298821
3 1.10 0.2178522
4 1.15 0.2059363
5 1.20 0.1941861
6 1.25 0.1826491
7 1.30 0.1713686
8 1.35 0.1603833
9 1.40 0.1497275

基本上,表格对数字及其相关值进行配对。现在,如果我想找到与数字1.22相关联的值(不在lk_tbl中),我想进行插值排序。

fuzzy_lkup<- function(x) {
  matched_num <- lk_tbl %>% 
    filter(num==x)     # check for exact val

  if(nrow(matched_num) == 1 ) { # if the exact match exists
    return(matched_num$val)
  } 
  else {
  return(lk_tbl %>% 
    filter( x < num + 0.05, x > num -0.05 ) %>%
    .[["val"]] %>%
    mean())

  }
}


> fuzzy_lkup(1) # it returns the matched value
[1] 0.2419707
> fuzzy_lkup(1.22) # it does the interpolation
[1] 0.1884176

# for the vector, I can use vapply like this.
> vapply(c(1.22, 1.18, 1.24), fuzzy_lkup,numeric(1))
[1] 0.1884176 0.2000612 0.1884176

最终,我想从巨大的查找表中为一个长向量执行此操作。

现在我观察到的是这个过程对于大型查找表来说速度非常慢所以我的问题是

  1. 你会如何加快速度? (矢量化这个函数?)
  2. 你如何用Rcpp来解决这个问题? Rcpp是否适合这个?您将如何导入查找表,您将使用什么数据结构来解决这个问题,最终如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

根据您的描述,潜在的方法可能是:

ff = function(x, num, val)
{
    i = findInterval(x, num)  #map input to the lookup-table

    #make the appropriate vectors to interpolate
    nums = c(rbind(num[i], x, num[i + 1L]))
    vals = c(rbind(val[i], NA, val[i + 1L]))

    #if 'mean' is needed; i.e. 'f(1.22) == f(1.24)' etc, the following could be used:
    #nums = seq_along(vals) 

    ans = approx(nums, vals, xout = nums)$y[seq(2L, length(nums), 3L)]

    return(cbind(x, ans))
}

一个例子:

ff(c(1.22, 1.18, 1.24, 1.05, 1.2, 1.22, 1.23, 1.24, 1.4, 1.5), lk_tbl$num, lk_tbl$val)
#         x       ans
# [1,] 1.22 0.1895713
# [2,] 1.18 0.1988861
# [3,] 1.24 0.1849565
# [4,] 1.05 0.2298821
# [5,] 1.20 0.1941861
# [6,] 1.22 0.1895713
# [7,] 1.23 0.1872639
# [8,] 1.24 0.1849565
# [9,] 1.40 0.1497275
#[10,] 1.50        NA

为了解决第二个问题,由于R的API,上述内容也可以方便地转移到C中:

ffC = inline::cfunction(sig = c(x = "numeric", num = "numeric", val = "numeric"), body = '
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));

    double *px = REAL(x), *pnum = REAL(num), *pval = REAL(val), *pans = REAL(ans);

    int n = LENGTH(num), flag;

    for(int i = 0, ind = 1; i < LENGTH(x); i++) {
        ind = findInterval(pnum, n, px[i], 0, 0, ind, &flag);

        pans[i] = ind == n ? (px[i] == pnum[n - 1] ? pval[n - 1] : NA_REAL) : 
             pval[ind - 1] + (pval[ind] - pval[ind - 1]) * 
                 ((px[i] - pnum[ind - 1]) / (pnum[ind] - pnum[ind - 1]));
    }

    UNPROTECT(1);
    return(ans);
', language = "C")

对这两种方法进行基准测试:

NUM = seq(1, 100, 0.2)
set.seed(007)
VAL = runif(length(NUM))
X = sample(1:110, 1e5, TRUE) + sample(seq(0, 1, 0.01), 1e5, TRUE)

all.equal(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL))
#[1] TRUE
microbenchmark::microbenchmark(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL), times = 30)
#Unit: milliseconds
#                  expr        min         lq       mean     median         uq       max neval cld
# ff(X, NUM, VAL)[, 2L] 182.215633 222.755943 236.844409 225.315683 236.060114 366.74375    30   b
#      ffC(X, NUM, VAL)   6.927356   6.986864   7.375294   7.078041   7.198103  10.10846    30  a