Question

我有9k行的data.table dt （参见下面的示例）。我需要将 dt 的每个 rname 与每个 cname <进行比较/ strong>引用data.table dt.ref 。通过比较，我的意思是计算Levenshtein比率。

然后，我取最大值并得到我的输出（见下文）。

DT

nid | rname | maxr n1 | apple | 0.5 n2 | pear | 0.8 n3 | banana | 0.7 n4 | kiwi | 0.6 ... (9k)

dt.ref

cid | cname c1 | apple c2 | maple c3 | peer c4 | dear c5 | bonobo c6 | kiwis ... (75k)

输出

nid | rname | maxr | maxLr | cid n1 | apple | 0.5 | 1 | c1 n2 | pear | 0.8 | 0.75 | c3 n2 | pear | 0.8 | 0.75 | c4 n3 | banana | 0.7 | 0.33 | c5 n4 | kiwi | 0.6 | 0.8 | c6 ...

要计算此输出，我在以这种方式编码的函数中使用stringdistmatrix函数（请参阅Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio）：

f1 <- function(x, y) { require(stringdist) require(matrixStats) dis <- stringdistmatrix(x, y, method = "lv") mat <- sapply(nchar(y), function(i) pmax(i, nchar(x))) r <- 1 - dis / mat w <- apply(r, 1, function(x) which(x==max(x))) m <- rowMaxs(r) list(m = m, w = w) } r <- f1(dt[[2]], dt.ref[[2]]) dt[, maxLr := r$m ] dtnew <- dt[rep(1:.N, lengths(r$w)),] dtnew[, cid := dt.ref[unlist(r$w), 1]]

然而，对于9k x 75k矩阵，我有一个内存问题，使R会话中止。除了拆分9k表之外，它还是一种方法：

代码区别仅使用data.table而不是矩阵？

对参考表进行排序和拆分，仅计算75k字符串的子集上的Levenshtein比率？

使用stringdistmatrix减少内存使用量

0 个答案: