我有一组数据,其中包含ID,时间戳和标识符。我必须经历它,计算熵并保存一些其他数据链接。在每一步,更多的标识符被添加到标识符字典中,我必须重新计算熵并附加它。我有非常大量的数据,并且程序因每个步骤后标识符的数量增加和熵计算而卡住。我阅读了以下解决方案,但它是关于由数字组成的数据。 Incremental entropy computation
我从这个页面复制了两个函数,并且熵的增量计算给出了与每个步骤的经典全熵计算不同的值。 这是我的代码:
InnerClasses:
public static final #68= #67 of #71; //Lookup=class java/lang/invoke/MethodHandles$Lookup of class java/lang/invoke/MethodHandles
BootstrapMethods:
0: #35 invokestatic java/lang/invoke/LambdaMetafactory.metafactory:(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;Ljava/lang/invoke/MethodType;Ljava/lang/invoke/MethodHandle;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
Method arguments:
#36 (Ljava/lang/Object;)Ljava/lang/Object;....
另一个问题是,当我打印“total_identifiers的总和”时,它会提供 12 而不是 14 ! (由于数据量非常大,我逐行读取实际文件并将结果直接写入磁盘,除了标识符字典外,不要将其存储在内存中。)
答案 0 :(得分:1)
上面的代码使用了定理4;在我看来,你想要使用定理5(来自下一段中的论文)。
但请注意,如果标识符的数量确实是问题,那么下面的增量方法也不会起作用 - 在某些时候字典会变得太大。
您可以在下面找到符合Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams描述的概念验证Python实现。
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
self._update_counts(count_changes)
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
print(naive_entropy(freq.values()))
print(entropy.entropy())
答案 1 :(得分:0)
感谢@blazs提供entropy_holder类。这解决了这个问题。因此,我们的想法是从(https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738)导入entropy_holder.py并使用它来存储先前的熵,并在新标识符到来时更新每一步。
所以最小工作代码如下所示:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
这种使用Blaz增量公式的熵与经典方法计算的熵非常接近,并且可以一次又一次地迭代所有数据。