使用德语语言设置SentiWS3和Scores的Twitter情感分析

时间:2014-05-15 11:03:21

标签: r sentiment-analysis

我指的是previously asked question:我想对德语推文做一个情绪分析,并且一直在使用我提到的stackoverflow线程中的代码。但是,我想做一个分析得到实际的情绪分数,而不仅仅是TRUE / FALSE的总和,无论一个词是积极的还是消极的。有什么想法可以轻松实现这一目标吗?

您也可以在previous thread

中找到单词列表
library(plyr)
library(stringr)

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  require(plyr)
  require(stringr)
  scores = laply(sentences, function(sentence, pos.words, neg.words) 
  {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    # and convert to lower case:
    sentence = tolower(sentence)
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    # match() returns the position of the matched term or NA
    # I don't just want a TRUE/FALSE! How can I do this?
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
  }, 
  pos.words, neg.words, .progress=.progress )
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))

2 个答案:

答案 0 :(得分:1)

  

有什么想法可以轻松实现这一目标吗?

嗯,是的。我用很多推文做同样的事情。如果你真的参与情感分析,你应该看看the Text Mining (tm) package

您将看到,使用文档术语矩阵可以让生活变得更轻松。然而,我必须警告你 - 阅读了几本期刊,一揽子单词方法通常只能准确地分类60%的情绪。如果您真的对进行高质量的研究感兴趣,那么您应该深入了解Peter Norvig的优秀“Artificial Intelligence: A Modern Approch”。

...所以这肯定不是一个快速的解决我情绪的方法。然而,两个月前我一直处于某种程度。

  

但是,我想做一个分析,得出实际的情绪分数

就像我去过那里一样,你可以将你的sentiWS改成这样一个不错的csv文件(对于否定):

NegBegriff  NegWert
Abbau   -0.058
Abbaus  -0.058
Abbaues -0.058
Abbauen -0.058
Abbaue  -0.058
Abbruch -0.0048
...

然后你可以将它作为一个漂亮的data.frame导入到R.我使用了这段代码:

### for all your words in each tweet in a row
for (n in 1:length(words)) {

  ## get the position of the match /in your sentiWS-file/
  tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff)
  tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff)

  ## now use the positions, to find the matching values and sum 'em up
  score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T) 
  score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T)
  score <- score.pos + score.neg

  ## now we have the sentiment for one tweet, push it to the list
  tweets.list.sentiment <- append(tweets.list.sentiment, score)
  ## and go again.
}

## look how beautiful!
summary(tweets.list.sentiment)

### caveat: This code is pretty ugly and not at all good use of R, 
### however it works sufficiently.  I am using approach from above, 
### thus I did not need to rewrite the latter.  Up to you ;- )

嗯,我希望它有效。 (对于我的例子,它dit)

诀窍在于将sentiWS带入一个漂亮的形式,这可以通过使用Excel宏,GNU Emacs,sed或其他任何您认为合适的文本操作来实现。

答案 1 :(得分:0)

作为起点,这一行:

words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)

说&#34;扔掉POS信息和情绪值(只留下你的单词列表)。

为了做你想做的事,你需要以不同的方式解析数据,你需要一个不同的数据结构。 readAndflattenSentiWS当前正在返回vector,但您需要返回一个查找表(从字符串到数字:使用env对象感觉非常合适,但如果我还想要POS信息然后data.frame开始感觉正确。

在那之后,你的大部分主循环可能大致相同,但是你需要收集这些值,并将它们相加,而不是只计算正负匹配的数量。