例如,我有一个内容如下的文本文件:
I wantto separate those wordswhich arejoined.
如何分隔本文中的单词,以便将其作为输出。
I want to separate those words which are joined.
基本上,可以从文本中检测出无意义的单词并使其有意义。
例如,代码应该检测到"想要"没有任何意义,经过处理后,它应该能够返回"想要"作为输出。
它可能会返回一些其他有意义的单词组合,但这很好。
答案 0 :(得分:3)
如果您有aspell(请参阅?aspell
已安装),这可能会给您一个提示:
> writeLines("I wantto separate those wordswhich arejoined.", "/tmp/test.txt")
> sp <- aspell('/tmp/test.txt')
> sp
arejoined
/tmp/test.txt:1:36
wantto
/tmp/test.txt:1:3
wordswhich
/tmp/test.txt:1:25
> sp[[5]]
[[1]]
[1] "want to" "want-to" "want" "wanton" "Watt" "watt" "wand" "went" "wont" "whatnot" "wants" "canto"
[13] "panto" "Wanda" "waned" "won't" "want's" "wanted" "NATO" "vanity" "wander" "winter" "wart" "natty"
[25] "vaunt" "wan" "ant" "walnut" "wasn't" "Witt" "wait" "wane" "wino"
[[2]]
[1] "words which" "words-which" "wordsmith" "Wordsworth" "words" "Woodstock" "word's" "woodsier"
[9] "Woods" "wards" "woods" "ward's" "woad's" "wood's" "wort's"
[[3]]
[1] "are joined" "are-joined" "rejoined" "adjoined" "enjoined" "rejoinder" "regained"
无论如何,这样的任务总是基于字典的。
答案 1 :(得分:2)
我附加了快速且脏代码,可以帮助您在不使用aspell的情况下纠正至少两个单词拼写错误。我使用的字典是Peter Norvig网站上的big.txt,应该足够常用的单词。您可以使用correctSentence
功能查看结果
## big.txt Taken for Peter Norvig's basic spell checker data file
words <- scan("http://norvig.com/big.txt", what = character())
split_matches <-function(word) {
num_char <- nchar(word)
return_str <- character()
start_pos <- 0
end_pos <- num_char
for(i in 1:num_char)
{
str <- substr(word,1,num_char-i+1)
if(str %in% words)
{
return_str <- str
start_pos <- nchar(return_str)
break
}
}
return_str <- c(return_str,substr(word,start_pos+1,end_pos))
return_str
}
correctSentence <- function(sentence) {
list_of_words <- strsplit(sentence," ")
list_of_words <- list_of_words[[1]]
num_words <- length(list_of_words)
output_str <- character()
for(i in 1:num_words){
word <- list_of_words[i]
if(word %in% words) {
paste(output_str,word,sep=" ")
output_str <- c(output_str,word)
}
else {
output_str <- c(output_str,split_matches(word))
}
}
output_str <-paste(output_str,collapse=" ")
output_str
}
# test this with your sentence
correctSentence("I wantto separate those wordswhich arejoined")