R - 从最常用的类别创建wordcloud

时间:2017-12-11 11:55:12

标签: r text-mining word-cloud

我正在尝试使用某些视频中最常用的类别标记创建一个词云。

一切运行正常,但是在创建文档矩阵时,某些类别会分成单个单词。这些受影响的类别使用“&”单词之间的符号。

(例如:River& Lake,Sea& Islands,Beach& Cliffs,...)

如何将这些单词保持在一起并正确创建单词云?

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

#load the text data into docs variable
docs <- Corpus(VectorSource(textos))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

#Text Mining. 
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, stripWhitespace)

screenshot of function inspect(docs) showing the words

#Document matrix is a table containing the frequency of the words. 
#Column names are words and row names are documents. 
#The function TermDocumentMatrix() from text mining package can be used as follow

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

after applying TermDocumentMatrix. the categories with "& symbol are separated in individual words

#plot the wordcloud

wordcloud(words = d$word, freq = d$freq, scale = c(3,.4), min.freq = 1,
          max.words=Inf, random.order=FALSE, rot.per=0.15, 
          colors=brewer.pal(6, "Dark2"))

result of wordcloud showing the most used categories

1 个答案:

答案 0 :(得分:0)

您的第一个屏幕截图显示您可以创建这样的字词向量:

docs = c("A & B", "A & B", "C", "C", "C", NA, "A & B", "A & B", "A & B", NA)

您的字词仍包含&

然后,您可以跳过在&上拆分的过程并改为运行:

library(dplyr)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

df_docs_counts = data.frame(docs, stringsAsFactors = F) %>%  # create a dataframe of words
      na.omit() %>%                                          # exclude NAs
      count(docs, sort=T)                                    # count number for each word

wordcloud(df_docs_counts$docs, df_docs_counts$n)