我指的是previously asked question:我想对德语推文做一个情绪分析,并且一直在使用我提到的stackoverflow线程中的代码。但是,我想做一个分析得到实际的情绪分数,而不仅仅是TRUE / FALSE的总和,无论一个词是积极的还是消极的。有什么想法可以轻松实现这一目标吗?
您也可以在previous thread。
中找到单词列表library(plyr)
library(stringr)
readAndflattenSentiWS <- function(filename) {
words = readLines(filename, encoding="UTF-8")
words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
words <- unlist(strsplit(words, ","))
words <- tolower(words)
return(words)
}
pos.words <- c(scan("Post3/positive-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("Post3/SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("Post3/negative-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("Post3/SentiWS_v1.8c_Negative.txt"))
score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
require(plyr)
require(stringr)
scores = laply(sentences, function(sentence, pos.words, neg.words)
{
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# I don't just want a TRUE/FALSE! How can I do this?
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
},
pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
sample <- c("ich liebe dich. du bist wunderbar",
"Ich hasse dich, geh sterben!",
"i love you. you are wonderful.",
"i hate you, die.")
(test.sample <- score.sentiment(sample,
pos.words,
neg.words))
答案 0 :(得分:1)
有什么想法可以轻松实现这一目标吗?
嗯,是的。我用很多推文做同样的事情。如果你真的参与情感分析,你应该看看the Text Mining (tm) package。
您将看到,使用文档术语矩阵可以让生活变得更轻松。然而,我必须警告你 - 阅读了几本期刊,一揽子单词方法通常只能准确地分类60%的情绪。如果您真的对进行高质量的研究感兴趣,那么您应该深入了解Peter Norvig的优秀“Artificial Intelligence: A Modern Approch”。
...所以这肯定不是一个快速的解决我情绪的方法。然而,两个月前我一直处于某种程度。
但是,我想做一个分析,得出实际的情绪分数
就像我去过那里一样,你可以将你的sentiWS改成这样一个不错的csv文件(对于否定):
NegBegriff NegWert
Abbau -0.058
Abbaus -0.058
Abbaues -0.058
Abbauen -0.058
Abbaue -0.058
Abbruch -0.0048
...
然后你可以将它作为一个漂亮的data.frame导入到R.我使用了这段代码:
### for all your words in each tweet in a row
for (n in 1:length(words)) {
## get the position of the match /in your sentiWS-file/
tweets.neg.position <- match(unlist(words[n]), neg.words$NegBegriff)
tweets.pos.position <- match(unlist(words[n]), pos.words$PosBegriff)
## now use the positions, to find the matching values and sum 'em up
score.pos <- sum(pos.words$PosWert[tweets.pos.position], na.rm = T)
score.neg <- sum(neg.words$NegWert[tweets.neg.position], na.rm = T)
score <- score.pos + score.neg
## now we have the sentiment for one tweet, push it to the list
tweets.list.sentiment <- append(tweets.list.sentiment, score)
## and go again.
}
## look how beautiful!
summary(tweets.list.sentiment)
### caveat: This code is pretty ugly and not at all good use of R,
### however it works sufficiently. I am using approach from above,
### thus I did not need to rewrite the latter. Up to you ;- )
嗯,我希望它有效。 (对于我的例子,它dit)
诀窍在于将sentiWS带入一个漂亮的形式,这可以通过使用Excel宏,GNU Emacs,sed或其他任何您认为合适的文本操作来实现。
答案 1 :(得分:0)
作为起点,这一行:
words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
说&#34;扔掉POS信息和情绪值(只留下你的单词列表)。
为了做你想做的事,你需要以不同的方式解析数据,你需要一个不同的数据结构。 readAndflattenSentiWS
当前正在返回vector
,但您需要返回一个查找表(从字符串到数字:使用env
对象感觉非常合适,但如果我还想要POS信息然后data.frame
开始感觉正确。
在那之后,你的大部分主循环可能大致相同,但是你需要收集这些值,并将它们相加,而不是只计算正负匹配的数量。