Question

我一直在尝试使用Udemy教程，使用R中的tm包在推文上进行文本挖掘。

到目前为止，本教程中指定的许多函数（以及在cran.org上的tm pdf中）都会导致一系列错误，我不清楚如何解决它们。我在RStudio版本1.0.143和macOS Sierra编码。下面的代码和错误来自我试图通过一系列推文制作wordcloud：

nyttweets <- searchTwitter("#NYT", n=1000)
nytlist <- sapply(nyttweets, function(x) x$getText())
nytcorpus <- Corpus(VectorSource(nytlist))

这是我遇到第一个错误的地方

nytcorpus <- tm_map(nytcorpus, tolower)
**Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code**

我看到了执行以下操作的建议，这会导致另一个错误

nytcorpus <- tm_map(nytcorpus, tolower, mc.cores=1)
**Error in FUN(X[[1L]], ...) : invalid multibyte string 1**

如果我在tolower之后使用'lazy = TRUE'以及我运行的其他后续函数，我没有收到错误：但是，当我最终尝试构造wordcloud时，我遇到了大量错误：< / p>

library("twitteR")
library('wordcloud')
library('SnowballC')
library('tm')
nytcorpus <- tm_map(nytcorpus, tolower, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, removePunctuation, lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, function(x) removeWords(x, stopwords()), 
lazy=TRUE)
nytcorpus <- tm_map(nytcorpus, PlainTextDocument)
wordcloud(nytcorpus, min.freq=4, scale=c(5,1), random.color=F, max.word=45, 
random.order=F)
**Warning messages:
1: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
'removewords' could not be fit on page. It will not be plotted.
2: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
"try-error" could not be fit on page. It will not be plotted.
3: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
applicable could not be fit on page. It will not be plotted.
4: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
object could not be fit on page. It will not be plotted.
5: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F,  :
usemethod("removewords", could not be fit on page. It will not be plotted.**

我不确定为什么函数wordcloud试图绘制实际的功能词，如'removewords'或'try-error'，而不是来自NYT推文的文字。我已经看到了在content_transformer中包装函数的建议，例如

nytcorpus <- tm_map(nytcorpus, content_transformer(tolower))

但是，我再次收到错误'所有计划的核心在用户'代码中遇到错误'。

这一切都非常令人沮丧，我不确定是否应该完全废弃使用tm软件包，特别是如果那里有更好的东西。任何建议都非常感谢。

Answer 1

tm最近一直试图提高它的速度，并且似乎是涉及Rcpp的重大改革，该包最初并未构建。也许您查看的教程基于旧版本的tm，这可能是您遇到问题的原因之一。

我会试试quanteda。

http://quanteda.io/

主要原因是它的数量级更快（尽管如上所述，这可能最近有所改变）。 Quanteda构建于stringi和data.table之上，已在C ++和C中进行了高度优化。quanteda基本上利用了迄今为止可用的一些最快R编程的工作。根据我的经验，它也更稳定，这取决于它所依赖的包的成熟度。

正如您将很快发现的那样，在构建和分析文档术语矩阵时，速度确实很重要，尤其是如果您创建各种长度的n-gram。因此，最好使用您能找到的最快的包。

贾斯汀

R tm包的问题

1 个答案: