我是R的新手,并且拥有81k个观测值的大数据。 如果我尝试:
titles_matrix <- as.matrix(titulos_tdm)
我明白了
Error: cannot allocate vector of size 40.8 Gb
我也尝试使用big.matrix,并得到了类似的错误。 如果我直接从语料库尝试:
wordcloud(clean,max.words = 200,random.color = TRUE,random.order=FALSE)
我明白了
Error in plot.new() : figure margins too large
对大量数据进行wordcloud的最佳方法是什么?
另外,我尝试添加gc()
,但还是没用。
完整代码:
# imports and libraries
source = VectorSource(perfil.df.publicacoes$titulo)
# Make a volatile corpus
corpus <- VCorpus(source)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removeWords, stopwords("portuguese"))
return(corpus)
}
clean <- clean_corpus(corpus)
# Convert TDM to matrix
titulos_tdm <- TermDocumentMatrix(clean)
titulos_m <- as.matrix(titulos_tdm)
# Sum rows and frequency data frame
titulos_term_freq <- rowSums(titulos_m)
titulos_word_freqs <- data.frame(
term = names(titulos_term_freq),
num = titulos_term_freq
)
wordcloud(titulos_word_freqs$term, titulos_word_freqs$num, max.words = 50, colors = "red")
正如我所说,代码在titulos_m <- as.matrix(titulos_tdm)
处崩溃。