Question

我有一组数据，其中包含ID，时间戳和标识符。我必须经历它，计算熵并保存一些其他数据链接。在每一步，更多的标识符被添加到标识符字典中，我必须重新计算熵并附加它。我有非常大量的数据，并且程序因每个步骤后标识符的数量增加和熵计算而卡住。我阅读了以下解决方案，但它是关于由数字组成的数据。 Incremental entropy computation

我从这个页面复制了两个函数，并且熵的增量计算给出了与每个步骤的经典全熵计算不同的值。这是我的代码：

InnerClasses:
 public static final #68= #67 of #71; //Lookup=class java/lang/invoke/MethodHandles$Lookup of class java/lang/invoke/MethodHandles
BootstrapMethods:
0: #35 invokestatic java/lang/invoke/LambdaMetafactory.metafactory:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/invoke/MethodType;Ljava/lang/invoke/MethodHandle;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
Method arguments:
  #36 (Ljava/lang/Object;)Ljava/lang/Object;....

另一个问题是，当我打印“total_identifiers的总和”时，它会提供 12 而不是 14 ！（由于数据量非常大，我逐行读取实际文件并将结果直接写入磁盘，除了标识符字典外，不要将其存储在内存中。）

Answer 1

上面的代码使用了定理4;在我看来，你想要使用定理5（来自下一段中的论文）。

但请注意，如果标识符的数量确实是问题，那么下面的增量方法也不会起作用 - 在某些时候字典会变得太大。

您可以在下面找到符合Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams描述的概念验证Python实现。

import collections
import math
import random


def log2(p):
    return math.log(p, 2) if p > 0 else 0


CountChange = collections.namedtuple('CountChange', ('label', 'change'))


class EntropyHolder:
    def __init__(self):
        self.counts_ = collections.defaultdict(int)

        self.entropy_ = 0
        self.sum_ = 0

    def update(self, count_changes):
        r = sum([change for _, change in count_changes])

        residual = self._compute_residual(count_changes)

        self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual

        self._update_counts(count_changes)

        return self.entropy_

    def _compute_residual(self, count_changes):
        r = sum([change for _, change in count_changes])
        residual = 0

        for label, change in count_changes:
            p_new = (self.counts_[label] + change) / (self.sum_ + r)
            p_old = self.counts_[label] / (self.sum_ + r)

            residual += p_new * log2(p_new) - p_old * log2(p_old)

        return residual

    def _update_counts(self, count_changes):
        for label, change in count_changes:
            self.sum_ += change
            self.counts_[label] += change

    def entropy(self):
        return self.entropy_



def naive_entropy(counts):
    s = sum(counts)
    return sum([-(r/s) * log2(r/s) for r in counts])


if __name__ == '__main__':
    print(naive_entropy([1, 1]))
    print(naive_entropy([1, 1, 1, 1]))

    entropy = EntropyHolder()
    freq = collections.defaultdict(int)
    for _ in range(100):
        index = random.randint(0, 5)
        entropy.update([CountChange(index, 1)])
        freq[index] += 1

    print(naive_entropy(freq.values()))
    print(entropy.entropy())

Answer 2

感谢@blazs提供entropy_holder类。这解决了这个问题。因此，我们的想法是从（https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738）导入entropy_holder.py并使用它来存储先前的熵，并在新标识符到来时更新每一步。

所以最小工作代码如下所示：

import entropy_holder

input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
          ["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
          ["7","2008-01-06T02:13:00Z","x,y"]]

entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
    for identifier in item[2].split(","):
        entropy.update([entropy_holder.CountChange(identifier, 1)])

print(entropy.entropy())

这种使用Blaz增量公式的熵与经典方法计算的熵非常接近，并且可以一次又一次地迭代所有数据。

计算非实数数据的增量熵

2 个答案: