在tm包中创建TermDocumentMatrix时出错

时间:2016-08-02 11:37:20

标签: r text-mining tm

我是a.js MyApp.vent.on("some:trigger", function(){ // ... }); b.js function test(){ doSomething(); MyApp.vent.trigger("some:trigger"); } 软件包的新手,在尝试应用tm函数时遇到了障碍。

我使用了以下代码,直到函数失败:

TermDocumentMatrix

经过检查,似乎文件清单应该是:

myCorpus <- Corpus(VectorSource(posts$message))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

myCorpus <- tm_map(myCorpus, removeURL)

myStopwords <- c(stopwords("english"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

myCorpusCopy <- myCorpus 
myCorpus <- tm_map(myCorpus, stemDocument)

然而,在创建完成词干的副本之后,问题出现了。

> for(i in 1:5) {
+   cat(paste("[[", i, "]] ", sep =""))
+   writeLines(myCorpus[[i]])
+ }
[[1]] syntel recruitment drive   week  freshers  newregistrationlink    passout graduates
qualification   graduatebebtechmcamemtech
syntel registration link  
limited referrals available 
comment  emailids  reference  future job upd
[[2]] dont miss  opportunity   get placed  one   best mnc companies   world ebay freshers  week  january 
qualification   graduate can apply
ebay registration link  
comment  emailids fast beacuse    referrals left
[[3]] recent passouts      eligible  apply  wipro  go   updated link  lastday reference drive jan  apply link  fresher referral
apply link 
go   link  apply asap
[[4]] robertbosch recruitment drive   week  freshers  newregistrationlink    passout graduates
qualification   graduatebebtechmcamemtech
robertbosch registration link  
limited referrals available 
comment  emailids  reference  future job upd
[[5]] mega job openings   year
mphasis recruitment  freshers january 
qualification   btech bsc bca  graduates mca mba  mtech post graduates
mphasis registration link  
comment  emailids  comment box  reference  future job updates   emailbox    

任何解决方法的想法?

2 个答案:

答案 0 :(得分:1)

我认为你必须回忆

myCorpus <- Corpus(VectorSource(myCorpus))

在使用 TermDocumentMatrix 之前,您的最后一段代码将是:

myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf)))

如果直到文档的截止没有发生错误,之前的说明将解决您的问题。

答案 1 :(得分:0)

否则,您可以先尝试:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

使用之前

myCorpus <- Corpus(VectorSource(myCorpus))