Question

我有一个可能包含x个不同元素的字符串，我需要测量这些元素的多样性。

为了计算字符串的理想熵（以位为单位），这是可能的最“多样化”的字符串（每个x元素彼此不同），我使用下面的代码：

    import math
    ideal = 'abcefghijk' # x = 10 number of elements, each is different
    probid = [ float(ideal.count(c)) / len(ideal) for c in dict.fromkeys(list(ideal)) ]
    entropy_ideal = - sum([ p * math.log(p) / math.log(2.0) for p in probid ])

然后，我将一个字符串与该“理想”多样性进行比较，然后计算其熵，然后除以理想值，以找到该分布的多样性指数：

    string = 'abccbbbbcc'
    prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
    index = entropy/entropy_ideal
    print(index)

我需要将该索引分类为“多样化” /“不多样化”，并且由于字符串的长度不同，值并不总是相同，所以我发现这很困难。

您对我如何修改代码或使用现有的python包能够做到我需要做的事情有任何建议吗？

更新

例如，对于

string = 'ccca'
ideal = 'abcd'

我知道

0.8112781244591328 # entropy of the string
0.4056390622295664 # relation

string = 'caaaav'
ideal = 'abcdef'

我知道

1.2516291673878228
0.4841962570206112

但是在我看来，第二种string仅比第一种更加多样化（我将其归为低多样性）。

如何在Python中测量分布的多样性（熵）？

0 个答案: