在python中使用完美压缩存储一个英国英语字母所需的平均位数

时间:2013-06-21 14:50:49

标签: python algorithm compression

我的作业编写如下:

如果完美的话,存储一封英国英语字母所需的平均位数是多少 使用压缩?

由于实验的熵可以解释为存储其结果所需的最小位数。我尝试制作一个计算所有字母熵的程序,然后将它们全部加在一起,找到所有字母的熵。

这给了我4.17位但是根据this link

使用完美的压缩算法,我们每个字符只需要2位!

那么如何在此实现这种完美的压缩算法?

import math
letters=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
sum =0

def find_perc(s):

perc=[0.082,0.015,0.028,0.043,0.127,0.022,0.02,0.061,0.07,0.002,0.008,0.04,0.024,0.067,0.075,0.019,0.001,0.060,0.063,0.091,0.028,0.01,0.023,0.001,0.02,0.001]

letter=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
pos = 0
temp = s.upper()
if temp in letter:
    for x in xrange(1,len(letter)):
        if temp==letter[x]:
            pos = x
return perc[pos]

def calc_ent(s):
P=find_perc(s)
sum=0
    #Calculates the Entropy of the current letter
temp = P *(math.log(1/P)/math.log(2)) 

    #Does the same thing just for binary entropy (i think)
#temp = (-P*(math.log(P)/math.log(2)))-((1-P)*(math.log(1-P)/math.log(2)))
sum=temp
return sum


for x in xrange(0,25):
    sum=sum+calc_ent(letters[x])

print "The min bit is : %f"%sum

2 个答案:

答案 0 :(得分:2)

没有完美的压缩,因为如果应用“完美压缩”,则计算位数是不可能的。请参阅Kolmogorov Complexity

您将无法在几行代码中实现压缩器,这些代码接近计算机程序对英文文本可压缩性的限制,每个字符大约一位。人类可以做little better

答案 1 :(得分:1)

您链接到的页面再次链接到此页面:

Refining the Estimated Entropy of English by Shannon Game Simulation

如果你仔细阅读,那么计算的熵就不会使用每个字母的出现概率进行天真计算 - 而是通过

来计算
  

主题显示前100个字符的文字,并被要求猜测下一个字符,直到成功

所以我认为你没有错,只有你使用的方法不同 - 只使用天真的发生概率数据,你不能很好地压缩信息,但如果你考虑上下文,那么冗余就更多了信息。例如,e的概率为0.127,但对于th_e可能更像是0.3。