如何在R中输出固定距离值的单词的所有可能偏差?

时间:2019-03-26 08:12:58

标签: r text-mining tidyverse stringr quanteda

我有一个字,想在R中将固定距离值的所有可能的偏差(替换,替换,插入)输出到向量中。

例如,单词“ Cat”和固定距离值为1会产生一个包含元素“ cot”,“ at”,...的向量。

1 个答案:

答案 0 :(得分:1)

我将假设您需要所有实际单词,而不仅仅是编辑距离为1的字符排列,其中包括非单词(例如“ zat”)。

我们可以使用adist()来执行此操作,以计算目标单词和所有符合条件的英语单词之间的编辑距离,该距离是从某个单词列表中得出的。在这里,我使用了 quanteda 包中的英语音节词典(毕竟您确实将此问题标记为quanteda),但这可能是来自任何其他来源的英语词典单词的向量好吧。

为了缩小范围,我们首先排除所有与目标单词长度不同的单词,具体取决于您的距离值。

distfn <- function(word, distance = 1) {
  # select eligible words for efficiency
  eligible_y_words <- names(quanteda::data_int_syllables)
  wordlengths <- nchar(eligible_y_words)
  eligible_y_words <- eligible_y_words[wordlengths >= (nchar(word) - distance) &
    wordlengths <= (nchar(word) + distance)]
  # compute Levenshtein distance
  distances <- utils::adist(word, eligible_y_words)[1, ]
  # return only those for the requested distance value
  eligible_y_words[distances == distance]
}

distfn("cat", 1)
##  [1] "at"   "bat"  "ca"   "cab"  "cac"  "cad"  "cai"  "cal"  "cam"  "can" 
## [11] "cant" "cao"  "cap"  "caq"  "car"  "cart" "cas"  "cast" "cate" "cato"
## [21] "cats" "catt" "cau"  "caw"  "cay"  "chat" "coat" "cot"  "ct"   "cut" 
## [31] "dat"  "eat"  "fat"  "gat"  "hat"  "kat"  "lat"  "mat"  "nat"  "oat" 
## [41] "pat"  "rat"  "sat"  "scat" "tat"  "vat"  "wat"

演示如何使用可选的距离值来处理更长的单词。

distfn("coffee", 1)
## [1] "caffee"  "coffeen" "coffees" "coffel"  "coffer"  "coffey"  "cuffee" 
## [8] "toffee"

distfn("coffee", 2)
##  [1] "caffey"   "calfee"   "chafee"   "chaffee"  "cofer"    "coffee's"
##  [7] "coffelt"  "coffers"  "coffin"   "cofide"   "cohee"    "coiffe"  
## [13] "coiffed"  "colee"    "colfer"   "combee"   "comfed"   "confer"  
## [19] "conlee"   "coppee"   "cottee"   "coulee"   "coutee"   "cuffe"   
## [25] "cuffed"   "diffee"   "duffee"   "hoffer"   "jaffee"   "joffe"   
## [31] "mcaffee"  "moffet"   "noffke"   "offen"    "offer"    "roffe"   
## [37] "scoffed"  "soffel"   "soffer"   "yoffie"

(是的,根据CMU的发音词典,这些都是真实的单词...)

编辑:考虑字母的所有排列,而不仅仅是实际单词

这涉及到与输入单词具有固定编辑距离的字母排列。在这里,我通过在合格范围内形成字母的所有排列,然后计算它们与目标单词的编辑距离,然后选择它们,来特别有效地完成它。因此,它是上述内容的一种变体,除了使用字典而不是字典以外,它使用置换词。

distfn2 <- function(word, distance = 1) {
  result <- character()

  # start with deletions
  for (i in max((nchar(word) - distance), 0):(nchar(word) - 1)) {
    result <- c(
      result,
      combn(unlist(strsplit(word, "", fixed = TRUE)), i,
        paste,
        collapse = "", simplify = TRUE
      )
    )
  }

  # now for changes and insertions
  for (i in (nchar(word)):(nchar(word) + distance)) {
    # all possible edits
    edits <- apply(expand.grid(rep(list(letters), i)),
      1, paste0,
      collapse = ""
    )
    # remove original word
    edits <- edits[edits != word]
    # get all distances, add to result
    distances <- utils::adist(word, edits)[1, ]
    result <- c(result, edits[distances == distance])
  }

  result
}

对于OP示例:

distfn2("cat", 1)
##   [1] "ca"   "ct"   "at"   "caa"  "cab"  "cac"  "cad"  "cae"  "caf"  "cag" 
##  [11] "cah"  "cai"  "caj"  "cak"  "cal"  "cam"  "can"  "cao"  "cap"  "caq" 
##  [21] "car"  "cas"  "aat"  "bat"  "dat"  "eat"  "fat"  "gat"  "hat"  "iat" 
##  [31] "jat"  "kat"  "lat"  "mat"  "nat"  "oat"  "pat"  "qat"  "rat"  "sat" 
##  [41] "tat"  "uat"  "vat"  "wat"  "xat"  "yat"  "zat"  "cbt"  "cct"  "cdt" 
##  [51] "cet"  "cft"  "cgt"  "cht"  "cit"  "cjt"  "ckt"  "clt"  "cmt"  "cnt" 
##  [61] "cot"  "cpt"  "cqt"  "crt"  "cst"  "ctt"  "cut"  "cvt"  "cwt"  "cxt" 
##  [71] "cyt"  "czt"  "cau"  "cav"  "caw"  "cax"  "cay"  "caz"  "cata" "catb"
##  [81] "catc" "catd" "cate" "catf" "catg" "cath" "cati" "catj" "catk" "catl"
##  [91] "catm" "catn" "cato" "catp" "catq" "catr" "cats" "caat" "cbat" "acat"
## [101] "bcat" "ccat" "dcat" "ecat" "fcat" "gcat" "hcat" "icat" "jcat" "kcat"
## [111] "lcat" "mcat" "ncat" "ocat" "pcat" "qcat" "rcat" "scat" "tcat" "ucat"
## [121] "vcat" "wcat" "xcat" "ycat" "zcat" "cdat" "ceat" "cfat" "cgat" "chat"
## [131] "ciat" "cjat" "ckat" "clat" "cmat" "cnat" "coat" "cpat" "cqat" "crat"
## [141] "csat" "ctat" "cuat" "cvat" "cwat" "cxat" "cyat" "czat" "cabt" "cact"
## [151] "cadt" "caet" "caft" "cagt" "caht" "cait" "cajt" "cakt" "calt" "camt"
## [161] "cant" "caot" "capt" "caqt" "cart" "cast" "catt" "caut" "cavt" "cawt"
## [171] "caxt" "cayt" "cazt" "catu" "catv" "catw" "catx" "caty" "catz"

虽然距离较长的单词会变得很慢,但也可以与其他编辑距离一起使用。

d2 <- distfn2("cat", 2)
set.seed(100)
c(head(d2, 50), sample(d2, 50), tail(d2, 50))
##   [1] "c"     "a"     "t"     "ca"    "ct"    "at"    "aaa"   "baa"  
##   [9] "daa"   "eaa"   "faa"   "gaa"   "haa"   "iaa"   "jaa"   "kaa"  
##  [17] "laa"   "maa"   "naa"   "oaa"   "paa"   "qaa"   "raa"   "saa"  
##  [25] "taa"   "uaa"   "vaa"   "waa"   "xaa"   "yaa"   "zaa"   "cba"  
##  [33] "aca"   "bca"   "cca"   "dca"   "eca"   "fca"   "gca"   "hca"  
##  [41] "ica"   "jca"   "kca"   "lca"   "mca"   "nca"   "oca"   "pca"  
##  [49] "qca"   "rca"   "cnts"  "cian"  "pcatb" "cqo"   "uawt"  "hazt" 
##  [57] "cpxat" "aaet"  "ckata" "caod"  "ncatl" "qcamt" "cdtp"  "qajt" 
##  [65] "bckat" "qcatr" "cqah"  "rcbt"  "cvbt"  "bbcat" "vcaz"  "ylcat"
##  [73] "cahz"  "jcgat" "mant"  "jatd"  "czlat" "cbamt" "cajta" "cafp" 
##  [81] "cizt"  "cmaut" "qwat"  "jcazt" "hdcat" "ucant" "hate"  "cajtl"
##  [89] "caaty" "cix"   "nmat"  "cajit" "cmnat" "caobt" "catoi" "ncau" 
##  [97] "ucoat" "ncamt" "jath"  "oats"  "chatz" "ciatz" "cjatz" "ckatz"
## [105] "clatz" "cmatz" "cnatz" "coatz" "cpatz" "cqatz" "cratz" "csatz"
## [113] "ctatz" "cuatz" "cvatz" "cwatz" "cxatz" "cyatz" "czatz" "cabtz"
## [121] "cactz" "cadtz" "caetz" "caftz" "cagtz" "cahtz" "caitz" "cajtz"
## [129] "caktz" "caltz" "camtz" "cantz" "caotz" "captz" "caqtz" "cartz"
## [137] "castz" "cattz" "cautz" "cavtz" "cawtz" "caxtz" "caytz" "caztz"
## [145] "catuz" "catvz" "catwz" "catxz" "catyz" "catzz"

可以通过减少所有排列的蛮力形成然后向其应用adist()来加快处理速度-它可以包含根据letters算法生成的已知编辑距离的更改或插入。