Question

我有一组单词：其中一些是合并的术语，另一些是简单的单词。我还有一个单独的单词列表，我将用它来与我的第一个列表（作为字典）进行比较，以便“取消合并”某些单词。

以下是一个例子：

ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")

我的一般程序是这样的：

搜索ListB中的模式，该模式在ListA中的单词中出现两次，其中合并的术语是连续的（单词中没有备用字母）。因此，例如，从ListA'lowerswim'将匹配'lower'和'swim'而不是'owe'和'swim'。
对于每个选定的单词，检查ListB中是否存在该单词。如果是，请将其保存在ListA中。否则，将单词拆分为与ListB

这听起来合情合理吗？如果是这样，我如何在R中实现它？也许这听起来很常规，但目前我遇到了麻烦：

搜索单词内的单词。我可以匹配列表中的单词没有问题，但我不确定我如何使用grep或等价物比这更进一步
声明单词必须是连续的。我已经考虑了一段时间，但我似乎无法尝试任何有效的方法

任何人都可以向我发送正确的方向吗？

Answer 1

我认为第一步是构建ListB的所有组合对：

pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
#  [1] "dodo"       "minedo"     "anddo"      "thedo"      "lowerdo"    "owedo"      "swimdo"    
#  [8] "domine"     "minemine"   "andmine"    "themine"    "lowermine"  "owemine"    "swimmine"  
# [15] "doand"      "mineand"    "andand"     "theand"     "lowerand"   "oweand"     "swimand"   
# [22] "dothe"      "minethe"    "andthe"     "thethe"     "lowerthe"   "owethe"     "swimthe"   
# [29] "dolower"    "minelower"  "andlower"   "thelower"   "lowerlower" "owelower"   "swimlower" 
# [36] "doowe"      "mineowe"    "andowe"     "theowe"     "lowerowe"   "oweowe"     "swimowe"   
# [43] "doswim"     "mineswim"   "andswim"    "theswim"    "lowerswim"  "oweswim"    "swimswim"

如果此类元素存在，您可以使用str_extract包中的stringr来提取combos的每个元素中包含的ListA元素：

library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA          "andthe"    "lowerswim" NA          NA

最后，您希望将ListA中与ListB中的一对元素匹配的字词拆分，除非此字词已在ListB中。我想有很多方法可以做到这一点，但我会使用lapply和unlist：

newA <- unlist(lapply(seq_along(ListA), function(idx) {
  if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
    return(ListA[idx])
  } else {
    return(as.vector(as.matrix(pairings[combos == matches[idx],])))
  }
}))
newA
# [1] "dopamine"  "and"       "the"       "lower"     "swim"      "other"     "different"

拆分合并的单词（带迷你字典）

1 个答案: