我试图从给定的查找表(lk_tbl)中查找一个值,因为缺少更好的单词,模糊匹配如下:
lk_tbl <- structure(list(num = c(1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35,
1.4), val = c(0.241970724519143, 0.229882140684233, 0.217852177032551,
0.205936268719975, 0.194186054983213, 0.182649085389022, 0.171368592047807,
0.16038332734192, 0.149727465635745)), .Names = c("num", "val"
), row.names = c(NA, -9L), class = "data.frame")
> lk_tbl
num val
1 1.00 0.2419707
2 1.05 0.2298821
3 1.10 0.2178522
4 1.15 0.2059363
5 1.20 0.1941861
6 1.25 0.1826491
7 1.30 0.1713686
8 1.35 0.1603833
9 1.40 0.1497275
基本上,表格对数字及其相关值进行配对。现在,如果我想找到与数字1.22
相关联的值(不在lk_tbl中),我想进行插值排序。
fuzzy_lkup<- function(x) {
matched_num <- lk_tbl %>%
filter(num==x) # check for exact val
if(nrow(matched_num) == 1 ) { # if the exact match exists
return(matched_num$val)
}
else {
return(lk_tbl %>%
filter( x < num + 0.05, x > num -0.05 ) %>%
.[["val"]] %>%
mean())
}
}
> fuzzy_lkup(1) # it returns the matched value
[1] 0.2419707
> fuzzy_lkup(1.22) # it does the interpolation
[1] 0.1884176
# for the vector, I can use vapply like this.
> vapply(c(1.22, 1.18, 1.24), fuzzy_lkup,numeric(1))
[1] 0.1884176 0.2000612 0.1884176
最终,我想从巨大的查找表中为一个长向量执行此操作。
现在我观察到的是这个过程对于大型查找表来说速度非常慢所以我的问题是
答案 0 :(得分:0)
根据您的描述,潜在的方法可能是:
ff = function(x, num, val)
{
i = findInterval(x, num) #map input to the lookup-table
#make the appropriate vectors to interpolate
nums = c(rbind(num[i], x, num[i + 1L]))
vals = c(rbind(val[i], NA, val[i + 1L]))
#if 'mean' is needed; i.e. 'f(1.22) == f(1.24)' etc, the following could be used:
#nums = seq_along(vals)
ans = approx(nums, vals, xout = nums)$y[seq(2L, length(nums), 3L)]
return(cbind(x, ans))
}
一个例子:
ff(c(1.22, 1.18, 1.24, 1.05, 1.2, 1.22, 1.23, 1.24, 1.4, 1.5), lk_tbl$num, lk_tbl$val)
# x ans
# [1,] 1.22 0.1895713
# [2,] 1.18 0.1988861
# [3,] 1.24 0.1849565
# [4,] 1.05 0.2298821
# [5,] 1.20 0.1941861
# [6,] 1.22 0.1895713
# [7,] 1.23 0.1872639
# [8,] 1.24 0.1849565
# [9,] 1.40 0.1497275
#[10,] 1.50 NA
为了解决第二个问题,由于R的API,上述内容也可以方便地转移到C中:
ffC = inline::cfunction(sig = c(x = "numeric", num = "numeric", val = "numeric"), body = '
SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
double *px = REAL(x), *pnum = REAL(num), *pval = REAL(val), *pans = REAL(ans);
int n = LENGTH(num), flag;
for(int i = 0, ind = 1; i < LENGTH(x); i++) {
ind = findInterval(pnum, n, px[i], 0, 0, ind, &flag);
pans[i] = ind == n ? (px[i] == pnum[n - 1] ? pval[n - 1] : NA_REAL) :
pval[ind - 1] + (pval[ind] - pval[ind - 1]) *
((px[i] - pnum[ind - 1]) / (pnum[ind] - pnum[ind - 1]));
}
UNPROTECT(1);
return(ans);
', language = "C")
对这两种方法进行基准测试:
NUM = seq(1, 100, 0.2)
set.seed(007)
VAL = runif(length(NUM))
X = sample(1:110, 1e5, TRUE) + sample(seq(0, 1, 0.01), 1e5, TRUE)
all.equal(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL))
#[1] TRUE
microbenchmark::microbenchmark(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL), times = 30)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# ff(X, NUM, VAL)[, 2L] 182.215633 222.755943 236.844409 225.315683 236.060114 366.74375 30 b
# ffC(X, NUM, VAL) 6.927356 6.986864 7.375294 7.078041 7.198103 10.10846 30 a