Question

我有一些调查数据，其中项目名称是删除了空格的调查文本。我想重新添加空格。显然，这需要一些英语知识。

是否有一个R函数可以在删除空格后正确地将空格插入句子中？
或者，是否存在有助于此过程的文本处理功能（例如，通过确定字母序列是单词还是非单词）？

以下是一些示例数据，但任何函数都应该适用于任意合理的句子：

x <- c("Shewrotehimalongletter,buthedidn'treadit.", 
      "Theshootersaysgoodbyetohislove.", 
      "WritingalistofrandomsentencesisharderthanIinitiallythoughtitwouldbe.", 
      "Letmehelpyouwithyourbaggage.", 
      "Pleasewaitoutsideofthehouse.", 
      "Iwantmoredetailedinformation.", 
      "Theskyisclear;thestarsaretwinkling.", 
      "Sometimes,allyouneedtodoiscompletelymakeanassofyourselfandlaughitofftorealisethatlifeisn’tsobadafterall.")

来源：http://www.randomwordgenerator.com/sentence.php

Answer 1

这是一个答案，但更多的是“可能没有一个独特的答案”答案。

ScrabbleScore包有2006年的锦标赛单词列表，因此我将其用作我搜索的“英语单词”的近似值。

library(ScrabbleScore)    
data("twl06")

我们可以通过在该列表中查找单词来检查单词是否为“英语”。

findword <- function(string) {
  if (string %in% twl06) return(string) else return(1)
}

让我们使用一个很好的模糊文本，好吗？这个引起了一些轰动，因为它被用作Susan Boyle的专辑派对的标签

x <- c("susanalbumparty")

我们可以检查“英语”单词的子串，并在找到单词时逐渐缩短字符串。这可以从开始或结束来完成，所以我将两者都证明答案不是唯一的

sentence_splitter <- function(x) {

  z <- y <- x
  words1 <- list()
  while(nchar(z) > 1) {
    while(findword(y) == 1 & nchar(y) > 1) {
      y <- substr(y, 2, nchar(y))
    }
    if (findword(y) != 1) words1 <- append(words1, y)
    y <- z <- substr(z, 1, nchar(z) - nchar(y))
  }

  z <- y <- x
  words2 <- list()
  while(nchar(z) > 1) {
    while(findword(y) == 1 & nchar(y) > 1) {
      y <- substr(y, 1, nchar(y) - 1)
    }
    if (findword(y) != 1) words2 <- append(words2, y)
    y <- z <- substr(z, 1 + nchar(y), nchar(z))
  }

  return(list(paste(unlist(rev(words1)), collapse = " "),
              paste(unlist(words2), collapse = " ")))

}

结果：

sentence_splitter("susanalbumparty")
#> [[1]]
#> [1] "us an album party"
#> 
#> [[2]]
#> [1] "us anal bump arty"

注意：这会找到每个方向搜索的最长子字符串（因为我正在缩短字符串）。你也可以通过扩展字符串来找到最短的。要正确地执行此操作，您需要查看仅保留有效单词的所有“英语”子字符串。

最后，您会注意到'susan'不匹配，因为根据此定义，它不是“有效的英语单词”。

希望这足以让你相信这不会很简单。

更新：在你的一些例子上尝试这个（一旦你tolower并且删除标点符号，它实际上并没有太糟糕）...最后一个是一个doozy，但其余的似乎没关系< / p>

unlist(lapply(sub("[[:punct:]]", "", tolower(x))[1:7], sentence_splitter))
#> "she wrote him along letter the did re adit"                                     
#> "shew rote him along letter but he did tread it"                                 
#> "the shooter says goodbye to his love"                                           
#> "the shooters ays goodbye to his love"                                           
#> "writing alist of random sentence sis harder ani initially though tit would be"  
#> "writing alist of randoms en ten es is harder than initially thought it would be"
#> "let me help you with your baggage"                                              
#> "let me help you withy our baggage"                                              
#> "please wait outside of the house"                                               
#> "please wait outside oft heh use"                                                
#> "want more detailed information"                                                 
#> "want more detailed information"                                                 
#> "the sky is clear the stars are twinkling"                                       
#> "the sky is clear the stars are twinkling"

当句子使用R删除空格时，如何在单词之间正确插入空格？

1 个答案: