如何正确制表unicode数据

时间:2013-12-18 10:20:53

标签: python unicode

(我在python 2.7上)

我有这个测试:

# -*- coding: utf-8 -*-

import binascii

test_cases = [
    'aaaaa',    # Normal bytestring
    'ááááá',    # Normal bytestring, but with extended ascii. Since the file is utf-8 encoded, this is utf-8 encoded
    'ℕℤℚℝℂ',    # Encoded unicode. The editor has encoded this, and it is defined as string, so it is left encoded by python
    u'aaaaa',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ááááá',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
    u'ℕℤℚℝℂ',   # unicode object. The string itself is utf-8 encoded, as defined in the "coding" directive at the top of the file
]
FORMAT = '%-20s -> %2d %-20s %-30s %-30s'
for data in test_cases :
    try:
        hexlified = binascii.hexlify(data)
    except:
        hexlified = None
    print FORMAT % (data, len(data), type(data), hexlified, repr(data))

产生输出:

aaaaa                ->  5 <type 'str'>         6161616161                     'aaaaa'                       
ááááá           -> 10 <type 'str'>         c3a1c3a1c3a1c3a1c3a1           '\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1\xc3\xa1'
ℕℤℚℝℂ      -> 15 <type 'str'>         e28495e284a4e2849ae2849de28482 '\xe2\x84\x95\xe2\x84\xa4\xe2\x84\x9a\xe2\x84\x9d\xe2\x84\x82'
aaaaa                ->  5 <type 'unicode'>     6161616161                     u'aaaaa'                      
ááááá                ->  5 <type 'unicode'>     None                           u'\xe1\xe1\xe1\xe1\xe1'       
ℕℤℚℝℂ                ->  5 <type 'unicode'>     None                           u'\u2115\u2124\u211a\u211d\u2102'

正如您所看到的,对于具有非ascii字符的字符串,列未正确对齐。这是因为这些字符串的长度(以字节为单位)大于unicode字符的数量。如何告诉print考虑字符数,而不是填充字段时的字节数?

1 个答案:

答案 0 :(得分:3)

当python 2.7看到'ℕℤℚℝℂ'时,它会读取“这里有15个任意字节值”。它不知道它们代表什么字符,也不知道它们代表它们的编码。您需要将此字节字符串解码为unicode字符串,指定编码,然后才能期望python能够计算字符数:

for data in test_cases :
    if isinstance(data, bytes):
        data = data.decode('utf-8')
    print FORMAT % (data, len(data), type(data), repr(data))

注意,在python 3中,所有字符串文字默认为unicode个对象