试图计算不同列表中的单词

时间:2017-12-06 16:07:59

标签: python for-loop nltk sentiment-analysis

我正在尝试进行情绪分析,以便在数据框中评分很多评论。我有一个消极的词语语料库,一个积极的词语。我想为每个正面词添加1,并为注释中的每个负面词删除1。我的代码:

     text['counts'] = 0
     for i in text.Reviews:
         if i in p:
             text['counts'] += 1
         elif i in n:
             text['counts'] +=-1

我希望新列text.counts能够为每条评论提供评论的分数,但到目前为止,我只是设法让每一行都显示总计数(就好像我的数据框是一个大评论。)< / p>

谢谢!

3 个答案:

答案 0 :(得分:0)

此处,您可以为每个评论指定一个特定计数,而不是全局计数。 :) 我复制了if语句,因为我假设你不想在每次迭代时检查,而是检查一些不太重要的东西,因此,更多的内存效率。 :d

text['commentsCount'] = {}
for i in text.Reviews:
    #   If review is positive
    if i in p:
        #   If comment_id key hasn't been added yet...
        if comment_id in text['commentsCount']:
            text['commentsCount'][comment_id] = 0
        text['commentsCount'][comment_id] += 1
    elif i in n:
        #   if comment_id key hasn't been added yet...
        if comment_id in text['commentsCount']:
            text['commentsCount'][comment_id] = 0
        text['commentsCount'][comment_id] -= 1

答案 1 :(得分:0)

这是你在找什么?

In [28]: text  = pd.DataFrame( ['good and not bad', 'it is a terrible bad product', 'excellent product'], columns = ['reviews'])

In [29]: text  
Out[29]: 
                        reviews
0              good and not bad
1  it is a terrible bad product
2             excellent product

In [30]: n = set('bad worse terrible worse bad baddest'.split())
In [31]: p = set('good better excellent good best bestest good'.split())

In [32]: text['count'] = text['reviews'].apply(lambda review: sum(0 + ((word in p) and 1) + ((word in n) and -1) for word in review.split()))

In [33]: text
Out[33]: 
                        reviews  count
0              good and not bad      0
1  it is a terrible bad product     -2
2             excellent product      1

答案 2 :(得分:0)

TL; DR

from collections import Counter

import pandas as pd
from nltk import word_tokenize

positive_words = set(['good', 'awesome', 'excellent'])
negative_words = set(['bad', 'terrible'])

df = pd.DataFrame( ['good and not bad', 'it is a terrible bad product', 'excellent product'], columns = ['Reviews'])

df['Tokenized'] = df['Reviews'].apply(str.lower).apply(word_tokenize)
df['WordCount'] = df['Tokenized'].apply(lambda x: Counter(x))

df['Positive'] = df['WordCount'].apply(lambda x: sum(v for k,v in x.items() if k in positive_words))
df['Negative'] = df['WordCount'].apply(lambda x: sum(v for k,v in x.items() if k in negative_words))

然后:

>>> df['Sentiment'] = df['Positive'] - df['Negative']
>>> df[['Reviews', 'Sentiment']]
                        Reviews  Sentiment
0              good and not bad          0
1  it is a terrible bad product         -2
2             excellent product          1

以上答案循环两次,这是另一种选择:

from collections import Counter

import pandas as pd
from nltk import word_tokenize

positive_words = set(['good', 'awesome', 'excellent'])
negative_words = set(['bad', 'terrible'])

df = pd.DataFrame( ['good and not bad', 'it is a terrible bad product', 'excellent product'], columns = ['Reviews'])

df['Tokenized'] = df['Reviews'].apply(str.lower).apply(word_tokenize)
df['WordCount'] = df['Tokenized'].apply(lambda x: Counter(x))

df['Sentiment'] = df['WordCount'].apply(lambda x: sum(v if k in positive_words else -v if k in negative_words else 0 for k,v in x.items()))