如何为R中的情感分析指定不同的分数?

时间:2015-08-07 14:58:54

标签: r algorithm twitter sentiment-analysis

我有一个Tweets文件,我想/需要对其进行情绪分析。 我遇到过this进程,但效果很好但是现在我想改变这段代码,以便根据情绪分配不同的分数。

这是代码:

    score.sentiment = function(sentences , pos.words, neg.words , progress='none')
{
 require(plyr)
 require(stringr)
 scores = laply(sentences,function(sentence,pos.words,neg.words)
 {
     sentence =gsub('[[:punct:]]','',sentence)
     sentence =gsub('[[:cntrl]]','',sentence)
     sentence =gsub('\\d+','',sentence)
     sentence=tolower(sentence)
     word.list=str_split(sentence,'\\s+')
     words=unlist(word.list)
     pos.matches=match(words,pos.words)
     neg.matches=match(words,neg.words)
     pos.matches = !is.na(pos.matches)   
     neg.matches = !is.na(neg.matches) 
     score=sum(pos.matches)-sum(neg.matches)
     return(score)
 },pos.words,neg.words,.progress=.progress)
 scores.df=data.frame(scores=scores,text=sentences)
 return(scores.df)
}  

我现在要做的是拥有四个词典;

super.words,pos,words,neg.words,terrible.words。

我想为每个词典指定不同的分数: super.words = + 2,pos.words = + 1,neg.words = -1,terrible.words = -2。

我知道pos.matches = !is.na(pos.matches)neg.matches = !is.na(neg.matches)分配1/0为TRUE / FALSE,但我想知道如何分配这些特定分数,这些分数为每条推文提供分数。

目前,我只关注标准的两个词典,pos和neg。 我已经为这两个数据框分配了分数:

posDF<-data.frame(words=pos, value=1, stringsAsFactors=F)

negDF<-data.frame(words=neg, value=-1, stringsAsFactors=F)

并尝试使用这些算法运行上述算法,但无效。

我遇到了this页面和this页面,其中一个人写了几个'for'循环,但最终结果只提供了-1,0或1的总分。

最终,我正在寻找类似于此的结果:

table(analysis$score)

-5 -4 -3 -2 -1 0 1 2 3 4 5 6 19

3 8 49 164 603 2790 ..................等

然而到目前为止,如果我得到一个不需要“调试”代码的结果,我得到这个:

< table of extent 0 >

以下是我正在使用的一些示例推文:

tweets<-data.frame(words=c("@UKLabour @KarlTurnerMP #LabourManifesto Speaking as a carer, labours NHS plans are all good news, very happy. Making my day this!", "#LabourManifesto eggs and sweet things are looking evil", "@UKLabour @KarlTurnerMP Half way through the #LabourManifesto, this will definitely improve every-bodies lives if implemented fully.", "There is nothing "long term" about fossil fuels. #fracking #labourmanifesto https://twitter.com/stevetopple/status/587576796599595012", "Fair play Ed, very strong speech! Finally had the chance to watch it. #LabourManifesto wanna see the other manifestos nowwww") )

非常感谢任何帮助!

所以,基本上,我想知道是否有办法改变原始剧本的这一部分:

pos.matches=match(words,pos.words)
 neg.matches=match(words,neg.words)
 pos.matches = !is.na(pos.matches)   
 neg.matches = !is.na(neg.matches)

所以我可以分配自己的特定分数? (pos.words = + 1,neg.words = -1)?或者,如果我必须包含各种if和for循环?

2 个答案:

答案 0 :(得分:0)

如果您只想在生成总分时使用自定义分数,则可以将此行score=sum(pos.matches)-sum(neg.matches)更改为:

score=sum((super.pos.matches)*2 + sum(pos.matches) + sum(neg.matches)*(-1) + sum(terrible.matches)*(-2))

答案 1 :(得分:0)

如果您正在考虑四个词典。(在您的功能线上,您在进度前缺少“。”。

以下代码对您有所帮助

        score.sentiment = function(sentences , pos.words, neg.words , .progress='none')
{
 require(plyr)
 require(stringr)
 scores = laply(sentences,function(sentence,pos.words,neg.words)
 {
     sentence =gsub('[[:punct:]]','',sentence)
     sentence =gsub('[[:cntrl]]','',sentence)
     sentence =gsub('\\d+','',sentence)
     sentence=tolower(sentence)
     word.list=str_split(sentence,'\\s+')
     words=unlist(word.list)
     pos.matches=match(words,pos.words)
     super.pos.matches=match(words,super.pos.words)
     neg.matches=match(words,neg.words)
     terrible.matches=match(words,terrible.words)
     pos.matches = !is.na(pos.matches)
     super.pos.matches = !is.na(super.pos.matches)   
     neg.matches = !is.na(neg.matches)
     terrible.matches = !is.na(terrible.matches) 
     score=sum((super.pos.matches)*2 + sum(pos.matches) - sum(neg.matches) 
           - sum(terrible.matches)*(2))
     return(score)
 },pos.words,neg.words,.progress=.progress)
 scores.df=data.frame(scores=scores,text=sentences)
 return(scores.df)
}