如何使用R

时间:2018-09-04 10:15:40

标签: r text sentiment-analysis word-cloud tidytext

我正在使用R进行情感分析,我想知道如何将wordcloud分为两部分,突出显示肯定和否定词。我对R很陌生,在线解决方案对我没有帮助。那是代码:

text <- readLines("product1.txt")

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

docs <- Corpus(VectorSource(text))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("don", "s", "t")) 
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

这是我想要实现的结果:

enter image description here

谢谢大家的帮助。

编辑:

docs <- structure(list(content = c("This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.", 
"Great for beginner or experienced person. Bought as a gift and she loves it.", 
"Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already.", 
"I have had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands."
), meta = structure(list(language = "en"), class = "CorpusMeta"), 
    dmeta = structure(list(), .Names = character(0), row.names = c(NA, 
    6L), class = "data.frame")), class = c("SimpleCorpus", "Corpus"
))

2 个答案:

答案 0 :(得分:2)

the tutorial所示,要获得这样的结果,您应该有一个词典,即一个提供给您的单词是肯定还是否定的“字典”。有了这些信息,您就可以使用它为wordcloud着色。
我们可以在链接中评论这个美丽的例子:

library(janeaustenr)
library(dplyr)
library(stringr)

# here we tidy up the corpus, all the J.Austen books, having them cleaned and as result, a tibble with words.
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

library(wordcloud)
library(reshape2)

如上所述,您需要一个词典。该链接讨论了各种词典,在这种情况下,它使用的是bing

get_sentiments("bing")
# A tibble: 6,788 x 2
   word        sentiment
   <chr>       <chr>    
 1 2-faced     negative 
 2 2-faces     negative 
 3 a+          positive 
 4 abnormal    negative 
 5 abolish     negative 
 6 abominable  negative 
 7 abominably  negative 
 8 abominate   negative 
 9 abomination negative 
10 abort       negative 
# ... with 6,778 more rows

现在,将tidy_books(语料库)和bing(词典)的每个词连接起来,我们可以给每个词一个正值或负值:

library(wordcloud)
library(reshape2)

 tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

您将获得所需的输出。显然,您必须将其弯曲到我没有的数据上。

enter image description here

编辑

结合您的情况,我们可以这样做:

# take all the phrases
docs1 <-tibble(phrases =docs$content)

# add an id, from 1 to n
docs1$ID <- row.names(docs1)

# split all the words
tidy_docs <- docs1 %>% unnest_tokens(word, phrases)

#create now the cloud: a pair of warnings, because you do not have negative words and it is joining by word(correct)
tidy_docs %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

enter image description here

答案 1 :(得分:0)

考虑这种方法。

library(flipTextAnalysis)
text.to.analyze <- input.phrases

# Converting the text to a vector
text.to.analyze <- as.character(text.to.analyze)

# Extracting the words from the text
library(flipTextAnalysis)
options <- GetTextAnalysisOptions(phrases = '', 
                                 extra.stopwords.text = 'amp',
                                 replacements.text = '',
                                 do.stem = TRUE,
                                 do.spell = TRUE)
text.analysis.setup <- InitializeWordBag(text.to.analyze, min.frequency = 5.0, operations = options$operations, manual.replacements = options$replacement.matrix, stoplist = options$stopwords, alphabetical.sort = FALSE, phrases = options$phrases, print.type = switch("Word Frequencies", "Word Frequencies" = "frequencies", "Transformed Text" = "transformations")) 

# Sentiment analysis of the phrases 
phrase.sentiment = SaveNetSentimentScores(text.to.analyze, check.simple.suffixes = TRUE, blanks.as.missing = TRUE) 
phrase.sentiment[phrase.sentiment >= 1] = 1
phrase.sentiment[phrase.sentiment <= -1] = -1

# Sentiment analysis of the words
td <- as.matrix(AsTermMatrix(text.analysis.setup, min.frequency = 1.0, sparse = TRUE))
counts <- text.analysis.setup$final.counts 
phrase.word.sentiment <- sweep(td, 1, phrase.sentiment, "*")
phrase.word.sentiment[td == 0] <- NA # Setting missing values to Missing
word.mean <- apply(phrase.word.sentiment,2, FUN = mean, na.rm = TRUE)
word.sd <- apply(phrase.word.sentiment,2, FUN = sd, na.rm = TRUE)
word.n <- apply(!is.na(phrase.word.sentiment),2, FUN = sum, na.rm = TRUE)
word.se <- word.sd / sqrt(word.n)
word.z <- word.mean / word.se
word.z[word.n <= 3 || is.na(word.se)] <- 0        
words <- text.analysis.setup$final.tokens
x <- data.frame(word = words, 
      freq = counts, 
      "Sentiment" = word.mean,
      "Z-Score" = word.z,
      Length = nchar(words))
word.data <- x[order(counts, decreasing = TRUE), ]

# Working out the colors
n = nrow(word.data)
colors = rep("grey", n)
colors[word.data$Z.Score < -1.96] = "Red"
colors[word.data$Z.Score > 1.96] =  "Green"

# Creating the word cloud
library(wordcloud2)
wordcloud2(data = word.data[, -3], color = colors, size = 0.4)

enter image description here

我真的不喜欢特朗普,但这很好地说明了这一点。

另外,请参见下面的两个链接,以获取有关如何处理此类问题的其他想法。

http://rstudio-pubs-static.s3.amazonaws.com/71296_3f3ee76e8ef34410a1635926f740c473.html

https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-sentiments-about-movies-using-twitter-and-text-analytics/