我正在尝试按照教程1在twitter上进行文本挖掘 我的代码是:
library(twitteR)
library(NLP)
library(tm)
library(wordcloud)
library(RColorBrewer)
mh370 <- searchTwitter("#PrayForMH370", since = "2014-03-08", until = "2014-03-20", n = 1000)
mh370_text = sapply(mh370, function(x) x$getText())
mh370_corpus = Corpus(VectorSource(mh370_text))
tdm = TermDocumentMatrix(mh370_corpus,control = list(removePunctuation = TRUE,stopwords = c("prayformh370", "prayformh", stopwords("english")),removeNumbers = TRUE, tolower = TRUE))
m = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing = TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word = names(word_freqs), freq = word_freqs)
wordcloud(dm$word,dm$freq,random.order=FALSE,colors=brewer.pal(8,"Dark2"))
当我运行最后一个代码时,我收到此错误:
Error in strwidth(words[i], cex = size[i], ...) : invalid 'cex' value
In addition: Warning messages:
1: In max(freq) : no non-missing arguments to max; returning -Inf
2: In max(freq) : no non-missing arguments to max; returning -Inf
请建议。
答案 0 :(得分:0)
正如Vikram所说,也许你应该通过在你的wordcloud中添加max.words
来减少你的情节中的单词数量。
wordcloud(dm$word, dm$freq, scale=c(8,3), min.freq=2, max.words=120,
random.order=FALSE, colors=brewer.pal(8,"Dark2"))
我还建议使用min.freq
来绘制至少出现两次的单词和scale
来控制单词的大小。调整这些直到你得到一个漂亮的情节。
答案 1 :(得分:0)
您可能也希望removeSparseTerms
。我遇到了类似的问题,前段时间我发现了solution。我不得不修改解决方案,但删除稀疏术语有效。 tm
包具有该功能。