Question

我有一个表情符号的Unicode代码点，表示为 U + 1F498 ：

emoticon = u'\U0001f498'

我想获得这个字符的utf-16个十进制组，根据this website，它们是55357和56472。

我尝试做print emoticon.encode("utf16")，但是因为它还提供了其他一些字符，所以根本没有帮助我。

此外，尝试按照以下说明print str(int("0001F498", 16)).decode("utf-8").encode("utf16")将其编码为UTF-16之前先从UTF-8解码也无济于事。

如何正确获取Unicode字符的utf-16十进制组？

Answer 1

您可以使用utf-16编码encode字符，然后使用int.from_bytes（或python 2中的struct.unpack）将编码数据的每2个字节转换为整数。

Python 3

def utf16_decimals(char, chunk_size=2):
    # encode the character as big-endian utf-16
    encoded_char = char.encode('utf-16-be')

    # convert every `chunk_size` bytes to an integer
    decimals = []
    for i in range(0, len(encoded_char), chunk_size):
        chunk = encoded_char[i:i+chunk_size]
        decimals.append(int.from_bytes(chunk, 'big'))

    return decimals

Python 2 + Python 3

import struct

def utf16_decimals(char):
    # encode the character as big-endian utf-16
    encoded_char = char.encode('utf-16-be')

    # convert every 2 bytes to an integer
    decimals = []
    for i in range(0, len(encoded_char), 2):
        chunk = encoded_char[i:i+2]
        decimals.append(struct.unpack('>H', chunk)[0])

    return decimals

结果：

>>> utf16_decimals(u'\U0001f498')
[55357, 56472]

Answer 2

在Python 2“窄”版本中，它很简单：

>>> emoticon = u'\U0001f498'
>>> map(ord,emoticon)
[55357, 56472]

这适用于Python 2（窄而宽的版本）和Python 3：

from __future__ import print_function
import struct

emoticon = u'\U0001f498'
print(struct.unpack('<2H',emoticon.encode('utf-16le')))

输出：

(55357, 56472)

这是一种更通用的解决方案，可以打印任意长度的字符串的UTF-16代码点：

from __future__ import print_function,division
import struct

def utf16words(s):
    encoded = s.encode('utf-16le')
    num_words = len(encoded) // 2
    return struct.unpack('<{}H'.format(num_words),encoded)

emoticon = u'ABC\U0001f498'
print(utf16words(emoticon))

输出：

(65, 66, 67, 55357, 56472)

如何在Python中获取UTF-16（十进制）？

2 个答案:

Python 3

Python 2 + Python 3