当句子使用R删除空格时,如何在单词之间正确插入空格?

时间:2016-08-09 01:15:50

标签: r

我有一些调查数据,其中项目名称是删除了空格的调查文本。 我想重新添加空格。显然,这需要一些英语知识。

  • 是否有一个R函数可以在删除空格后正确地将空格插入句子中?
  • 或者,是否存在有助于此过程的文本处理功能(例如,通过确定字母序列是单词还是非单词)?

以下是一些示例数据,但任何函数都应该适用于任意合理的句子:

x <- c("Shewrotehimalongletter,buthedidn'treadit.", 
      "Theshootersaysgoodbyetohislove.", 
      "WritingalistofrandomsentencesisharderthanIinitiallythoughtitwouldbe.", 
      "Letmehelpyouwithyourbaggage.", 
      "Pleasewaitoutsideofthehouse.", 
      "Iwantmoredetailedinformation.", 
      "Theskyisclear;thestarsaretwinkling.", 
      "Sometimes,allyouneedtodoiscompletelymakeanassofyourselfandlaughitofftorealisethatlifeisn’tsobadafterall.")

来源:http://www.randomwordgenerator.com/sentence.php

1 个答案:

答案 0 :(得分:5)

这是一个答案,但更多的是“可能没有一个独特的答案”答案。

ScrabbleScore包有2006年的锦标赛单词列表,因此我将其用作我搜索的“英语单词”的近似值。

library(ScrabbleScore)    
data("twl06")

我们可以通过在该列表中查找单词来检查单词是否为“英语”。

findword <- function(string) {
  if (string %in% twl06) return(string) else return(1)
}

让我们使用一个很好的模糊文本,好吗?这个引起了一些轰动,因为它被用作Susan Boyle的专辑派对的标签

x <- c("susanalbumparty")

我们可以检查“英语”单词的子串,并在找到单​​词时逐渐缩短字符串。这可以从开始或结束来完成,所以我将两者都证明答案不是唯一的

sentence_splitter <- function(x) {

  z <- y <- x
  words1 <- list()
  while(nchar(z) > 1) {
    while(findword(y) == 1 & nchar(y) > 1) {
      y <- substr(y, 2, nchar(y))
    }
    if (findword(y) != 1) words1 <- append(words1, y)
    y <- z <- substr(z, 1, nchar(z) - nchar(y))
  }

  z <- y <- x
  words2 <- list()
  while(nchar(z) > 1) {
    while(findword(y) == 1 & nchar(y) > 1) {
      y <- substr(y, 1, nchar(y) - 1)
    }
    if (findword(y) != 1) words2 <- append(words2, y)
    y <- z <- substr(z, 1 + nchar(y), nchar(z))
  }

  return(list(paste(unlist(rev(words1)), collapse = " "),
              paste(unlist(words2), collapse = " ")))

}

结果:

sentence_splitter("susanalbumparty")
#> [[1]]
#> [1] "us an album party"
#> 
#> [[2]]
#> [1] "us anal bump arty"

注意:这会找到每个方向搜索的最长子字符串(因为我正在缩短字符串)。你也可以通过扩展字符串来找到最短的。要正确地执行此操作,您需要查看仅保留有效单词的所有“英语”子字符串。

最后,您会注意到'susan'不匹配,因为根据此定义,它不是“有效的英语单词”。

希望这足以让你相信这不会很简单。

更新:在你的一些例子上尝试这个(一旦你tolower并且删除标点符号,它实际上并没有太糟糕)...最后一个是一个doozy,但其余的似乎没关系< / p>

unlist(lapply(sub("[[:punct:]]", "", tolower(x))[1:7], sentence_splitter))
#> "she wrote him along letter the did re adit"                                     
#> "shew rote him along letter but he did tread it"                                 
#> "the shooter says goodbye to his love"                                           
#> "the shooters ays goodbye to his love"                                           
#> "writing alist of random sentence sis harder ani initially though tit would be"  
#> "writing alist of randoms en ten es is harder than initially thought it would be"
#> "let me help you with your baggage"                                              
#> "let me help you withy our baggage"                                              
#> "please wait outside of the house"                                               
#> "please wait outside oft heh use"                                                
#> "want more detailed information"                                                 
#> "want more detailed information"                                                 
#> "the sky is clear the stars are twinkling"                                       
#> "the sky is clear the stars are twinkling"