如何在R中的给定文本中分隔单词?

时间:2014-06-05 11:08:10

标签: r text-mining

例如,我有一个内容如下的文本文件:

I wantto separate those wordswhich arejoined.

如何分隔本文中的单词,以便将其作为输出。

I want to separate those words which are joined.

基本上,可以从文本中检测出无意义的单词并使其有意义。

例如,代码应该检测到"想要"没有任何意义,经过处理后,它应该能够返回"想要"作为输出。

它可能会返回一些其他有意义的单词组合,但这很好。

2 个答案:

答案 0 :(得分:3)

如果您有aspell(请参阅?aspell已安装),这可能会给您一个提示:

> writeLines("I wantto separate those wordswhich arejoined.", "/tmp/test.txt")
> sp <- aspell('/tmp/test.txt')
> sp
arejoined
  /tmp/test.txt:1:36

wantto
  /tmp/test.txt:1:3

wordswhich
  /tmp/test.txt:1:25
> sp[[5]]
[[1]]
 [1] "want to" "want-to" "want"    "wanton"  "Watt"    "watt"    "wand"    "went"    "wont"    "whatnot" "wants"   "canto"  
[13] "panto"   "Wanda"   "waned"   "won't"   "want's"  "wanted"  "NATO"    "vanity"  "wander"  "winter"  "wart"    "natty"  
[25] "vaunt"   "wan"     "ant"     "walnut"  "wasn't"  "Witt"    "wait"    "wane"    "wino"   

[[2]]
 [1] "words which" "words-which" "wordsmith"   "Wordsworth"  "words"       "Woodstock"   "word's"      "woodsier"   
 [9] "Woods"       "wards"       "woods"       "ward's"      "woad's"      "wood's"      "wort's"     

[[3]]
[1] "are joined" "are-joined" "rejoined"   "adjoined"   "enjoined"   "rejoinder"  "regained"  

无论如何,这样的任务总是基于字典的。

答案 1 :(得分:2)

我附加了快速且脏代码,可以帮助您在不使用aspell的情况下纠正至少两个单词拼写错误。我使用的字典是Peter Norvig网站上的big.txt,应该足够常用的单词。您可以使用correctSentence功能查看结果

## big.txt Taken for Peter Norvig's basic spell checker data file
words <- scan("http://norvig.com/big.txt", what = character())

split_matches <-function(word) {
num_char <- nchar(word)
return_str <- character()
start_pos <- 0
end_pos <- num_char
for(i in 1:num_char)
{
    str <- substr(word,1,num_char-i+1)
    if(str %in% words)
    {
      return_str <- str
      start_pos <- nchar(return_str)
      break
    }

 }
 return_str <- c(return_str,substr(word,start_pos+1,end_pos))
 return_str

}

correctSentence <- function(sentence) {
  list_of_words <- strsplit(sentence," ")
  list_of_words  <- list_of_words[[1]]
  num_words <- length(list_of_words)

  output_str <- character()
  for(i in 1:num_words){
  word <- list_of_words[i]
  if(word %in% words) {
      paste(output_str,word,sep=" ")
      output_str <- c(output_str,word)
  }
  else {
     output_str <- c(output_str,split_matches(word))
  }

}
  output_str <-paste(output_str,collapse=" ")
  output_str
}
# test this with your sentence
correctSentence("I wantto separate those wordswhich arejoined")