我有一个学生的评论:
The course was interesting, but the professor was so boring.
包含所有情绪词及其极性(正极性和负极性)的情绪数据帧
> sentiment_DF
word positive-polarity negative_polarity
interesting 1 0
boring 0 1
pretty 1 0
...
我尝试用R做一个函数来确定文本情感词的极性。 所以为此,我提取了文本中的所有单词:
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
然后,检查列表中的每个单词是否存在于sentiment_dataframe中并确定其极性 我尝试使用此代码:
library(data.table)
dt <- setDT(sentiment_DF)
dt <- melt(sentiment_DF, id.vars = "word")
dt[word == "b" & value > 0, variable]
算法:
overall_sentiment <- 0
while there is sentiment_word in text do
polarity <- get_polarity(sentiment_word)
overall_sentiment <- overall_sentiment + polarity
end while
你能帮我吗?
谢谢
---- ----编辑
基本算法更改为以下版本:
overall_sentiment <- 0
while there is sentiment_word in text do
polarity <- get_polarity(sentiment_word)
if booster_word in context(sentiment_word)
if negation_word in context(sentiment_word)
polarity <- polarity/3
else
polarity <- polarity*3
end if
end if
overall_sentiment <- overall_sentiment + polarity
end while
booster_word <- c("more", "very", "too", "much", "completely", "absolutely", "fully", "totally", "definitely", "extremely", "often", "frequently", "enough", "a lot")
negation_word <-c("never", "nothing", "no", "never", "not", "no more")
我做了一个提取sentiment_word上下文的函数(一个特定单词前3个单词的样本)。
getContext <- function(text, look_for, pre = 3, post=pre) {
# create vector of words (anything separated by a space)
t_vec <- unlist(strsplit(text, '\\s'))
# find position of matches
matches <- which(t_vec==look_for)
# return words before & after if any matches
if(length(matches) > 0) {
out <-
list(before = ifelse(m-pre < 1, NA,
sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), )
return(out)
} else {
warning('No matches')
}
}
以下是一个例子:
"the course was very interesting, but the professor was too boring."
"Stackoverflow is an intersting place with too interesting people"
第一句:
"the course was *very interesting*, but the professor was *too boring*."
(1*3) + (-1*3) = 0
借口句:
"Stackoverflow is an *intersting* place with *too interesting* people"
1+(1*3) = 4
我现在的问题是如何验证id的上下文是否在带有R的booster_word中? 好吗?
谢谢
答案 0 :(得分:2)
也许这对你有用:
### function to calculate the polarity of sentences
calcPolarity <- function(sentiment_DF,sentences){
# separate each sentence in words using regular expression
# (it returns a list with the words of each sentence)
sentencesSplitInWords <- regmatches(sentences,gregexpr("[[:word:]]+",sentences,perl=TRUE))
# pre-allocate the polarity result vector with size = number of sentences
polarity <- rep.int(0,length(sentencesSplitInWords))
for(i in 1:length(polarity)){
# get the i-th sentence words
wordsOfASentence <- sentencesSplitInWords[[i]]
# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
# calculate the total polarity of the sentence and store in the vector
polarity[i] <- sum(subDF$positive.polarity) - sum(subDF$negative.polarity)
}
return(polarity)
}
用法:
sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
positive.polarity=c(1,0,1),
negative.polarity=c(0,1,0))
sentences <- c("The course was interesting, but the professor was so boring.",
"stackoverflow is an interesting place with interesting people!")
result <- calcPolarity(sentiment_DF,sentences)
# > result
# [1] 0 2
答案 1 :(得分:0)
你应该首先提取单词。 (可能使用正则表达式,以确保你没有得到像“有趣”这样的词。将句子的单词存储在一个名为:words_of_sentence的变量中。 然后你可以使用:
dt[word %in% words_of_sentence & value > 0, variable]