Question

词汇丰富度的熵公式是

概率p-ith是通过将V-i除以N来计算的，其中N是文本中令牌的总数，而V-i是特定类型出现的次数（至少是这个'）我的理解）。

所以，如果我有一个字符串the, the, the, a, a, over, love, one, tree 有9 tokens，但只有6 types。

V-'theth'（据我所知）将是3，因此p-'theth'将被计算为3/9 = 0.33。 V-'ath'将为0.22，依此类推。此实例中的H为-100*((0.33*log0.33 + 0.22*log0.22 + 0.11*log0.11 + 0.11*log0.11 + 0.11*log0.11+ 0.11*log0.11)/log9)

虽然我可以在Python中获取字符串（标记）的长度：

 string = ['the', 'the', 'the', 'a', 'a', 'over', 'love', 'one', 'tree']
 len(string)
 9

类型数量：

len(set(string))
6

我不完全确定如何在Python中计算这个公式。感谢。

来源：Dale，Moisl和Somers（第551页）。＆＃34;自然语言处理手册＆＃34; （2000年）。 https://books.google.at/books?id=VoOLvxyX0BUC&pg=PA551&lpg=PA551&dq=entropy+vocabulary+richness&source=bl&ots=wucWFF1Rn_&sig=Hms1qwhXlcOaPEXI84eDqxsTEdo&hl=en&sa=X&ved=0CC8Q6AEwAmoVChMIjvvQnvPVxwIVhJ5yCh35ZAb_#v=onepage&q&f=false

Answer 1

要计算Sigma，您可以这样做：

def calculateEntropy(freqDict,total):
    entropy=0
    nbElements=0
    for element in freqDict:
        p=float(freqDict[element])/total
        entropy-=p*math.log(p,2)
        nbElements+=1
    if nbElements==total:
        return entropy
    else:
        return calculateEntropy(freqDict,nbElements)

要获取令牌频率，您可以使用带有令牌的简单dict作为键，并将其作为值出现。要获得完整的公式，您仍然必须获得100*entropy/math.log(nbElements,2)

词汇丰富如Shannon's entropy;蟒蛇

1 个答案: