我有一组单词:其中一些是合并的术语,另一些是简单的单词。我还有一个单独的单词列表,我将用它来与我的第一个列表(作为字典)进行比较,以便“取消合并”某些单词。
以下是一个例子:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
我的一般程序是这样的:
这听起来合情合理吗?如果是这样,我如何在R中实现它?也许这听起来很常规,但目前我遇到了麻烦:
grep
或等价物比这更进一步任何人都可以向我发送正确的方向吗?
答案 0 :(得分:3)
我认为第一步是构建ListB
的所有组合对:
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
# [1] "dodo" "minedo" "anddo" "thedo" "lowerdo" "owedo" "swimdo"
# [8] "domine" "minemine" "andmine" "themine" "lowermine" "owemine" "swimmine"
# [15] "doand" "mineand" "andand" "theand" "lowerand" "oweand" "swimand"
# [22] "dothe" "minethe" "andthe" "thethe" "lowerthe" "owethe" "swimthe"
# [29] "dolower" "minelower" "andlower" "thelower" "lowerlower" "owelower" "swimlower"
# [36] "doowe" "mineowe" "andowe" "theowe" "lowerowe" "oweowe" "swimowe"
# [43] "doswim" "mineswim" "andswim" "theswim" "lowerswim" "oweswim" "swimswim"
如果此类元素存在,您可以使用str_extract
包中的stringr
来提取combos
的每个元素中包含的ListA
元素:
library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA "andthe" "lowerswim" NA NA
最后,您希望将ListA
中与ListB
中的一对元素匹配的字词拆分,除非此字词已在ListB
中。我想有很多方法可以做到这一点,但我会使用lapply
和unlist
:
newA <- unlist(lapply(seq_along(ListA), function(idx) {
if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
return(ListA[idx])
} else {
return(as.vector(as.matrix(pairings[combos == matches[idx],])))
}
}))
newA
# [1] "dopamine" "and" "the" "lower" "swim" "other" "different"