R:TermDocumentMatrix - 创建

时间:2016-05-07 13:10:37

标签: r term-document-matrix mclapply

我正在尝试获取Twitter数据并创建wordcloud,但我的代码在创建TermDocumentMatrix时出错。我的代码如下

twitter_search_data <- searchTwitter(searchString = text_to_search
                                    ,n = 500)

twitter_search_text <- sapply(twitter_search_data
                             ,function(x) x$getText())

twitter_search_corpus <- Corpus(VectorSource(twitter_search_text))

twitter_search_corpus <- tm_map(twitter_search_corpus, stripWhitespace, lazy = TRUE)

twitter_search_corpus <- tm_map(twitter_search_corpus, content_transformer(tolower), lazy = TRUE)

twitter_search_corpus <- tm_map(twitter_search_corpus, PlainTextDocument,lazy = TRUE)    

twitter_search_corpus <- tm_map(twitter_search_corpus, removePunctuation, lazy = TRUE)

twitter_search_corpus <- tm_map(twitter_search_corpus, removeNumbers, lazy = TRUE)

twitter_search_corpus <- tm_map(twitter_search_corpus, removeWords, c("the", "this", "The", "This", stopwords('english')), lazy = TRUE)

twitter_search_corpus <- tm_map(twitter_search_corpus, stemDocument, lazy = TRUE)

# Create Document Term Matrix 
tdm <- as.matrix(TermDocumentMatrix(twitter_search_corpus
                                   ,control=list(wordLengths=c(3,Inf))
                                   ))

创建TermDocumentMatrix之前没有错误。我得到的错误如下

  

mclapply中的警告(x $ content [i],函数(d)tm_reduce(d,x $ lazy $ maps)):       计划核心1在用户代码中遇到错误,该作业的所有值都将受到影响       mclapply中的警告(unname(content(x)),termFreq,control):       计划核心1在用户代码中遇到错误,该作业的所有值都将受到影响       警告:UseMethod中出错:没有适用于“meta”的方法应用于类“try-error”的对象       堆栈跟踪(最里面的第一个):       74:FUN
      73:lapply
      72:setNames
      71:as.list.VCorpus
      70:as.list
      69:lapply
      68:meta.VCorpus
      67:元
      66:TermDocumentMatrix.VCorpus
      65:TermDocumentMatrix
      64:as.matrix
      63:observeEventHandler
       1:runApp

我已添加lazy = TRUEcontent_transformer(tolower),但错误仍然存​​在。

1 个答案:

答案 0 :(得分:0)

问题似乎是放置

twitter_search_corpus <- tm_map(twitter_search_corpus, stripWhitespace, lazy = TRUE)

删除标点符号后,文本中插入了数字和单词空格。因此,在创建TermDocumentMatrix之前,删除空格的上述代码必须是最后一条语句。