从R

时间:2016-05-30 13:11:10

标签: r tm topic-modeling

我有一套文件:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

在这组文件中,我想删除停用词。我已经删除了标点并转换为小写,使用:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先我转换为Corpus对象:

documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但最后一行会导致以下错误:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY ___ YOU_MUST_EXEC ()进行调试。

这已经被问到here,但没有给出答案。这个错误意味着什么?

修改

是的,我正在使用tm包。

这是sessionInfo()的输出:

R版本3.0.2(2013-09-25) 平台:x86_64-apple-darwin10.8.0(64位)

3 个答案:

答案 0 :(得分:10)

当我遇到tm问题时,我经常最终只是编辑原始文本。

为了删除单词,它有点尴尬,但您可以将tm的禁用词列表中的正则表达式粘贴在一起。

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

答案 1 :(得分:0)

也许尝试使用tm_map函数来转换文档。它似乎适用于我的情况。

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

这会产生

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

答案 2 :(得分:0)

您可以使用quanteda包删除停用词,但首先要确保您的单词是令牌,然后使用以下内容:

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)