Question

我指的是previously asked question：我想对德语推文做一个情绪分析，并且一直在使用我提到的stackoverflow线程中的代码。但是，我想做一个分析得到实际的情绪分数，而不仅仅是TRUE / FALSE的总和，无论一个词是积极的还是消极的。有什么想法可以轻松实现这一目标吗？

您也可以在previous thread。

中找到单词列表

library(plyr)
library(stringr)

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  require(plyr)
  require(stringr)
  scores = laply(sentences, function(sentence, pos.words, neg.words) 
  {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    # match() returns the position of the matched term or NA
    # I don't just want a TRUE/FALSE! How can I do this?
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
  }, 
  pos.words, neg.words, .progress=.progress )
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))

Answer 1

有什么想法可以轻松实现这一目标吗？

嗯，是的。我用很多推文做同样的事情。如果你真的参与情感分析，你应该看看the Text Mining (tm) package。

您将看到，使用文档术语矩阵可以让生活变得更轻松。然而，我必须警告你 - 阅读了几本期刊，一揽子单词方法通常只能准确地分类60％的情绪。如果您真的对进行高质量的研究感兴趣，那么您应该深入了解Peter Norvig的优秀“Artificial Intelligence: A Modern Approch”。

...所以这肯定不是一个快速的解决我情绪的方法。然而，两个月前我一直处于某种程度。

但是，我想做一个分析，得出实际的情绪分数

就像我去过那里一样，你可以将你的sentiWS改成这样一个不错的csv文件（对于否定）：

NegBegriff  NegWert
Abbau   -0.058
Abbaus  -0.058
Abbaues -0.058
Abbauen -0.058
Abbaue  -0.058
Abbruch -0.0048
...

然后你可以将它作为一个漂亮的data.frame导入到R.我使用了这段代码：

### for all your words in each tweet in a row
for (n in 1:length(words)) {

  ## get the position of the match /in your sentiWS-file/
  tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff)
  tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff)

  ## now use the positions, to find the matching values and sum 'em up
  score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
  score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T)
  score <- score.pos + score.neg

  ## now we have the sentiment for one tweet, push it to the list
  tweets.list.sentiment <- append(tweets.list.sentiment, score)
  ## and go again.
}

## look how beautiful!
summary(tweets.list.sentiment)

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently.  I am using approach from above, 
### thus I did not need to rewrite the latter.  Up to you ;- )

嗯，我希望它有效。（对于我的例子，它dit）

诀窍在于将sentiWS带入一个漂亮的形式，这可以通过使用Excel宏，GNU Emacs，sed或其他任何您认为合适的文本操作来实现。

Answer 2

作为起点，这一行：

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)

说＆＃34;扔掉POS信息和情绪值（只留下你的单词列表）。

为了做你想做的事，你需要以不同的方式解析数据，你需要一个不同的数据结构。 readAndflattenSentiWS当前正在返回vector，但您需要返回一个查找表（从字符串到数字：使用env对象感觉非常合适，但如果我还想要POS信息然后data.frame开始感觉正确。

在那之后，你的大部分主循环可能大致相同，但是你需要收集这些值，并将它们相加，而不是只计算正负匹配的数量。

使用德语语言设置SentiWS3和Scores的Twitter情感分析

2 个答案: