如何用Python中的实际位编码?

时间:2018-02-01 17:40:54

标签: python compression bits huffman-code encoder

我在Python中构建了一个霍夫曼编码器,但由于我将这些位(代表字符)存储为字符串,因此编码后的文本比原始文本大。如何使用实际位来正确压缩文本?

1 个答案:

答案 0 :(得分:1)

You can convert a str of 1s and 0s to an int type variable like this:

>>> int('10110001',2)
177

And you can convert ints back to strs of 1s and 0s like this:

>>> format(177,'b')
'10110001'

Also, note that you can write int literals in binary using a leading 0b, like this:

>>> foo = 0b10110001
>>> foo
177

Now, before you say "No, I asked for bits, not ints!" think about that for a second. An int variable isn't stored in the computer's hardware as a base-10 representation of the number; it's stored directly as bits.


EDIT: Stefan Pochmann points out that this will drop leading zeros. Consider:

>>> code = '000010110001'
>>> bitcode = int(code, 2)
>>> format(bitcode, 'b')
'10110001'

So how do you keep the leading zeros? There are a few ways. How you go about it will likely depend on whether you want to type cast each character into an int first and then concatenate them, or concatenate the strings of 1s and 0s before type casting the whole thing as an int. The latter will probably be much simpler. One way that will work well for the latter is to store the length of the code and then use that with this syntax:

>>> format(bitcode, '012b')
'000010110001'

where '012b' tells the format function to pad the left of the string with enough zeros to ensure a minimum length of 12. So you can use it in this way:

>>> code = '000010110001'
>>> code_length = len(code)
>>> bitcode = int(code, 2)
>>> format(bitcode, '0{}b'.format(code_length))
'000010110001'

Finally, if that {} and second format is unfamiliar to you, read up on string formatting.