Question

一个是目标数据框（targetframe），另一个数据框用作具有一些键值的库（word.library）。然后我需要以下算法：算法应查找word.library$mainword和targetframe$words之间的近似匹配。在计算出近似匹配后，targetframe $ words中的子串应替换为word.library$keyID。

以下是上述两个数据框：

tragetframe <- data.frame(words= c("This is sentence one with the important word",
                                 "This is sentence two with the inportant woord",
                                  "This is sentence three with crazy sayings" ))

word.library <- data.frame(mainword = c("important word",
                                        "crazy sayings"),
                           keyID = c("1001",
                                     "2001"))

这是我的解决方案。

for(i in 1:nrow(word.library)){
positions <- aregexec(word.library[i,1], tragetframe$words, max.distance = 0.1)
res <- regmatches(tragetframe$words, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow
tragetframe$words <- Vectorize(gsub)(unlist(res), word.library[i,2], tragetframe$words)
tragetframe$words
}

但是：我使用了一个非常有效的for循环（假设我有两个巨大的数据帧）。有谁知道如何更有效地解决这个问题？

aregexec与两个数据帧匹配

0 个答案: