Question

我正在使用Vader SentimentAnalyzer获取极性分数。之前我使用了正/负/中性的概率分数，但我刚刚意识到“复合”分数，范围从-1（大多数负）到1（大多数pos）将提供单一的极性测量。我想知道如何计算“复合”分数。这是从[pos，neu，neg]向量计算的吗？

Answer 1

VADER算法将情绪分数输出为4类情绪https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441：

neg：否定
neu：中立
pos：正面
compound：复合（即汇总分数）

让我们一起浏览代码，复合的第一个实例位于https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421，在那里计算：

compound = normalize(sum_s)

normalize()函数在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107：

中定义

def normalize(score, alpha=15):
    """
    Normalize the score to be between -1 and 1 using an alpha that
    approximates the max expected value
    """
    norm_score = score/math.sqrt((score*score) + alpha)
    return norm_score

所以有一个超参数alpha。

至于sum_s，它是传递给score_valence()函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413

的情绪参数的总和

如果我们追溯这个sentiment参数，我们会看到它是在https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217调用polarity_scores()函数时计算的：

def polarity_scores(self, text):
    """
    Return a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative
    valence.
    """
    sentitext = SentiText(text)
    #text, words_and_emoticons, is_cap_diff = self.preprocess(text)

    sentiments = []
    words_and_emoticons = sentitext.words_and_emoticons
    for item in words_and_emoticons:
        valence = 0
        i = words_and_emoticons.index(item)
        if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
            words_and_emoticons[i+1].lower() == "of") or \
            item.lower() in BOOSTER_DICT:
            sentiments.append(valence)
            continue

        sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)

    sentiments = self._but_check(words_and_emoticons, sentiments)

查看polarity_scores函数，它正在做的是遍历整个SentiText词典并使用基于规则的sentiment_valence()函数进行检查，以将效价分数分配给情绪https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243 ，见http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

的第2.1.1节

所以回到复合分数，我们看到：

compound分数是sum_s和
sum_s是基于某些启发式和情感词典（又称情绪强度）和
归一化分数只是sum_s除以其平方加上增加归一化函数分母的alpha参数。

是根据[pos，neu，neg]向量计算的吗？

不是真的=）

如果我们看一下score_valence函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411，我们会看到复合得分是使用sum_s计算的，而pos，neg和neu得分是使用{{计算的1}}使用来自_sift_sentiment_scores()的原始分数计算不变的pos，neg和neu分数而没有总和。

如果我们看看这个sentiment_valence() mathemagic，看起来规范化的输出相当不稳定（如果不受约束），取决于alpha的值：

alpha：