在Python 2.7

时间:2016-07-29 06:33:57

标签: python python-2.7 unicode utf-8


   with open(name, 'r') as content_file:
        content = content_file.read()
        for i in range(10):
            print content[i]


的问候, 林

2 个答案:

答案 0 :(得分:13)




utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)


© ® ™ 9                                                                                                                                        
� '\xc2'                                                                                                                                       
� '\xa9'                                                                                                                                       
  ' '
� '\xc2'
� '\xae'
  ' '
� '\xe2'
� '\x84'
� '\xa2'
© ® ™ 5
Stack Overflow联合创始人Joel Spolsky撰写了一篇关于Unicode的好文章:The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

您还应该查看Python文档中的Unicode HOWTO文章和Ned Batchelder的Pragmatic Unicode文章,即“Unipain”。


utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w


0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [™]

FWIW,这是该代码的Python 3版本:

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

如果我们不知道UTF-8字符串中字符的字节宽度,那么我们需要做更多的工作。每个UTF-8序列在第一个字节中编码序列的宽度,如the Wikipedia article on UTF-8中所述。

以下Python 2演示展示了如何提取宽度信息;它产生与前两个片段相同的输出。

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
        raise ValueError('%r is not a single byte' % b)

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w


对于好奇,这是一个{3}的Python 3版本,以及一个手动解码UTF-8字节串的函数。




答案 1 :(得分:5)


s = unicode(your_object).encode('utf8')
print s