如何将utf-8字符串显示/转换为正确的符号

时间:2015-06-02 04:29:12

标签: python unicode encoding utf-8

我有一个列表,其中WhatsApp表情符号编码为utf-8字符。我用来解码表情符号的表位于http://apps.timwhitlock.info/emoji/tables/unicode

使用此表,我试图计算使用的表情符号的数量,这是我使用正则表达式技术成功完成的。问题是我创建了一个字典,其中键是utf-8字符作为字符串,key_values是整数。以下内容:

print d_emo
for k, v in d_emo.items():
    print k.encode('utf8'), v

生成此输出:

{'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
\xF0\x9F\x98\xA2 2
\xF0\x9F\x98\x82 1
\xF0\x9F\x98\x86 2
\xF0\x9F\x98\x89 1
\xF0\x9F\x8D\xB5 2
\xF0\x9F\x8D\xB0 4
\xF0\x9F\x8D\xAB 2
\xF0\x9F\x8D\xA9 2
\xF0\x9F\x98\x98 1
\xE2\x98\xBA 33
\xE2\x98\x95 1

如果我使用此代码:

for k, v in d_emo.items():
    print k.encode('utf-8').decode('unicode_escape'), v

我得到了

ð¢ 2
ð 1
ð 2
ð 1
ðµ 2
ð° 4
ð« 2
ð© 2
ð 1
⺠33
â 1

我应该得到笑脸等。有什么建议?这是在Python 2.7中。

1 个答案:

答案 0 :(得分:1)

这将正确解码Unicode字符,但在Python 2.X中,当使用BMP之外的字符(基本多语言平面,字符U + 0000到U + FFFF)时,您会受到一些限制:

import unicodedata as ud
D = {'\\xF0\\x9F\\x98\\xA2': 2, '\\xF0\\x9F\\x98\\x82': 1, '\\xF0\\x9F\\x98\\x86': 2, '\\xF0\\x9F\\x98\\x89': 1, '\\xF0\\x9F\\x8D\\xB5': 2, '\\xF0\\x9F\\x8D\\xB0': 4, '\\xF0\\x9F\\x8D\\xAB': 2, '\\xF0\\x9F\\x8D\\xA9': 2, '\\xF0\\x9F\\x98\\x98': 1, '\\xE2\\x98\\xBA': 33, '\\xE2\\x98\\x95': 1}
for k,v in D.iteritems():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print k,repr(k),n

输出:

☺ u'\u263a' WHITE SMILING FACE
 u'\U0001f369' no such name
☕ u'\u2615' HOT BEVERAGE
 u'\U0001f602' no such name
 u'\U0001f36b' no such name
 u'\U0001f622' no such name
 u'\U0001f609' no such name
 u'\U0001f618' no such name
 u'\U0001f606' no such name
 u'\U0001f375' no such name
 u'\U0001f370' no such name

在Python 3.X中出现更好:

import unicodedata as ud
D = {b'\\xF0\\x9F\\x98\\xA2': 2, b'\\xF0\\x9F\\x98\\x82': 1, b'\\xF0\\x9F\\x98\\x86': 2, b'\\xF0\\x9F\\x98\\x89': 1, b'\\xF0\\x9F\\x8D\\xB5': 2, b'\\xF0\\x9F\\x8D\\xB0': 4, b'\\xF0\\x9F\\x8D\\xAB': 2, b'\\xF0\\x9F\\x8D\\xA9': 2, b'\\xF0\\x9F\\x98\\x98': 1, b'\\xE2\\x98\\xBA': 33, b'\\xE2\\x98\\x95': 1}
for k,v in D.items():
    k = k.decode('unicode-escape').encode('latin1').decode('utf8')
    try:
        n = ud.name(k)
    except ValueError:
        n = 'no such name'
    print(k,ascii(k),n)

输出(注意你的字体必须支持字符):

 '\U0001f618' FACE THROWING A KISS
 '\U0001f370' SHORTCAKE
 '\U0001f622' CRYING FACE
 '\U0001f36b' CHOCOLATE BAR
 '\U0001f375' TEACUP WITHOUT HANDLE
 '\U0001f369' DOUGHNUT
 '\U0001f602' FACE WITH TEARS OF JOY
 '\U0001f609' WINKING FACE
☕ '\u2615' HOT BEVERAGE
 '\U0001f606' SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES
☺ '\u263a' WHITE SMILING FACE