Question

babybag-婴儿袋
恶作剧者-避难所
themoderncornerstore-现代角落商店
hamptonfamilyguidebook-汉普顿家庭指南

有没有一种方法可以使用R从没有空格或其他定界符的字符串中提取单词？我有一个URL列表，我试图弄清楚URL中包含哪些词。

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

Answer 1

这是一种幼稚的方法，可能会给您带来启发，我使用了hunspell库，但是您可以针对任何字典测试子字符串。

我从右边开始，尝试每个子字符串，并保持在词典中可以找到的最长，然后更改我的开始位置，因为它很慢，所以我希望您不要有400万个。 hampton不在此词典中，因此对于最后一个词典而言，它给出的结果不正确：

split_words <- function(x){
  candidate <- x
  words <- NULL
  j <- nchar(x)
  while(j !=0){
    word <- NULL
    for (i in j:1){
      candidate <- substr(x,i,j)
      if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate
    }
    if(is.null(word)) return("")
    words <- c(word,words)
    j <- j-nchar(word)
  }
  words
}


input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

lapply(input,split_words)
# [[1]]
# [1] "baby" "bag" 
# 
# [[2]]
# [1] "bad"     "shelter"
# 
# [[3]]
# [1] "the"    "modern" "corner" "store" 
# 
# [[4]]
# [1] "h"         "amp"       "ton"       "family"    "guidebook"
#

这是一个快速解决方法，可以将单词手动添加到字典中：

split_words <- function(x, additional = c("hampton","otherwordstoadd")){
  candidate <- x
  words <- NULL
  j <- nchar(x)
  while(j !=0){
    word <- NULL
    for (i in j:1){
      candidate <- substr(x,i,j)
      if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate
    }
    if(is.null(word)) return("")
    words <- c(word,words)
    j <- j-nchar(word)
  }
  words
}


input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

lapply(input,split_words)
# [[1]]
# [1] "baby" "bag" 
# 
# [[2]]
# [1] "bad"     "shelter"
# 
# [[3]]
# [1] "the"    "modern" "corner" "store" 
# 
# [[4]]
# [1] "hampton"   "family"    "guidebook"
#

您可以交叉手指但不要有任何含糊的表情。请注意，"guidebook"是我的输出中的一个单词，因此在您的四个示例中我们已经有一个小写的情况。

使用R从没有空格或定界符的字符串中提取单词

1 个答案: