我有一些调查数据,其中项目名称是删除了空格的调查文本。 我想重新添加空格。显然,这需要一些英语知识。
以下是一些示例数据,但任何函数都应该适用于任意合理的句子:
x <- c("Shewrotehimalongletter,buthedidn'treadit.",
"Theshootersaysgoodbyetohislove.",
"WritingalistofrandomsentencesisharderthanIinitiallythoughtitwouldbe.",
"Letmehelpyouwithyourbaggage.",
"Pleasewaitoutsideofthehouse.",
"Iwantmoredetailedinformation.",
"Theskyisclear;thestarsaretwinkling.",
"Sometimes,allyouneedtodoiscompletelymakeanassofyourselfandlaughitofftorealisethatlifeisn’tsobadafterall.")
答案 0 :(得分:5)
这是一个答案,但更多的是“可能没有一个独特的答案”答案。
ScrabbleScore
包有2006年的锦标赛单词列表,因此我将其用作我搜索的“英语单词”的近似值。
library(ScrabbleScore)
data("twl06")
我们可以通过在该列表中查找单词来检查单词是否为“英语”。
findword <- function(string) {
if (string %in% twl06) return(string) else return(1)
}
让我们使用一个很好的模糊文本,好吗?这个引起了一些轰动,因为它被用作Susan Boyle的专辑派对的标签
x <- c("susanalbumparty")
我们可以检查“英语”单词的子串,并在找到单词时逐渐缩短字符串。这可以从开始或结束来完成,所以我将两者都证明答案不是唯一的
sentence_splitter <- function(x) {
z <- y <- x
words1 <- list()
while(nchar(z) > 1) {
while(findword(y) == 1 & nchar(y) > 1) {
y <- substr(y, 2, nchar(y))
}
if (findword(y) != 1) words1 <- append(words1, y)
y <- z <- substr(z, 1, nchar(z) - nchar(y))
}
z <- y <- x
words2 <- list()
while(nchar(z) > 1) {
while(findword(y) == 1 & nchar(y) > 1) {
y <- substr(y, 1, nchar(y) - 1)
}
if (findword(y) != 1) words2 <- append(words2, y)
y <- z <- substr(z, 1 + nchar(y), nchar(z))
}
return(list(paste(unlist(rev(words1)), collapse = " "),
paste(unlist(words2), collapse = " ")))
}
结果:
sentence_splitter("susanalbumparty")
#> [[1]]
#> [1] "us an album party"
#>
#> [[2]]
#> [1] "us anal bump arty"
注意:这会找到每个方向搜索的最长子字符串(因为我正在缩短字符串)。你也可以通过扩展字符串来找到最短的。要正确地执行此操作,您需要查看仅保留有效单词的所有“英语”子字符串。
最后,您会注意到'susan'不匹配,因为根据此定义,它不是“有效的英语单词”。
希望这足以让你相信这不会很简单。
更新:在你的一些例子上尝试这个(一旦你tolower
并且删除标点符号,它实际上并没有太糟糕)...最后一个是一个doozy,但其余的似乎没关系< / p>
unlist(lapply(sub("[[:punct:]]", "", tolower(x))[1:7], sentence_splitter))
#> "she wrote him along letter the did re adit"
#> "shew rote him along letter but he did tread it"
#> "the shooter says goodbye to his love"
#> "the shooters ays goodbye to his love"
#> "writing alist of random sentence sis harder ani initially though tit would be"
#> "writing alist of randoms en ten es is harder than initially thought it would be"
#> "let me help you with your baggage"
#> "let me help you withy our baggage"
#> "please wait outside of the house"
#> "please wait outside oft heh use"
#> "want more detailed information"
#> "want more detailed information"
#> "the sky is clear the stars are twinkling"
#> "the sky is clear the stars are twinkling"