Question

我在Python 2中工作，我有一个包含emojis以及其他unicode字符的字符串。我需要将其转换为列表，其中列表中的每个条目都是单个字符/表情符号。

x = u'xyz'
char_list = [c for c in x]

所需的输出是：

['', '', 'x', 'y', 'z', '', '']

实际输出是：

[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a']

如何实现所需的输出？

Answer 1

首先，在Python2中，您需要使用Unicode字符串（u'<...>'）将Unicode字符视为Unicode字符。如果您想使用字符本身而不是源代码中的\UXXXXXXXX表示，请correct source encoding。

现在，根据Python: getting correct string length when it contains surrogate pairs和Python returns length of 2 for single Unicode character string，在Python2“narrow”版本中（使用sys.maxunicode==65535），32位Unicode字符表示为surrogate pairs，这是对字符串函数不透明。这只在3.3（PEP0393）中修复。

最简单的解决方案（除了迁移到3.3+之外）是从源代码编译Python“宽”构建，如第3个链接所述。其中，Unicode字符都是4字节（因此是一个潜在的内存耗尽）但如果你需要定期处理宽的Unicode字符，这可能是一个可接受的价格。

“窄”构建的解决方案是来制作一组自定义字符串函数（len，slice;也许是unicode）的子类，它将检测代理对并将它们作为单个字符处理。我不能轻易找到现有的（这很奇怪），但写起来并不难：

根据UTF-16#U+10000 to U+10FFFF - Wikipedia，
- 第一个字符（高代理人）在范围0xD800..0xDBFF
- 第二个字符（低代理人） - 范围0xDC00..0xDFFF
- 这些范围是保留的，因此不能作为常规字符出现

所以这是检测代理对的代码：

def is_surrogate(s,i):
    if 0xD800 <= ord(s[i]) <= 0xDBFF:
        try:
            l = s[i+1]
        except IndexError:
            return False
        if 0xDC00 <= ord(l) <= 0xDFFF:
            return True
        else:
            raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])
    else:
        return False

一个返回简单切片的函数：

def slice(s,start,end):
    l=len(s)
    i=0
    while i<start and i<l:
        if is_surrogate(s,i):
            start+=1
            end+=1
            i+=1
        i+=1
    while i<end and i<l:
        if is_surrogate(s,i):
            end+=1
            i+=1
        i+=1
    return s[start:end]

在这里，您支付的价格是性能，因为这些函数比内置函数慢得多：

>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000
>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)
46.44128203392029    #msec
>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)
8.814016103744507    #usec

Answer 2

我会使用uniseg库（pip install uniseg）：

# -*- coding: utf-8 -*-
from uniseg import graphemecluster as gc

print list(gc.grapheme_clusters(u'xyz'))

输出[u'\U0001f618', u'\U0001f618', u'x', u'y', u'z', u'\U0001f60a', u'\U0001f60a']和

[x.encode('utf-8') for x in gc.grapheme_clusters(u'xyz'))]

将提供UTF-8编码字符串的字符列表。

从Unicode字符串中正确提取Emojis

2 个答案: