Question

我注意到，当您在手机短信中输入表情符号时，其中一些人占用1个字符，其中一些人正在服用2个字。示例：＆＃34;♊＆＃34;拿1个字符但是＆＃34;＆＃34;需要2.
在python中，我试图获得表情符号的长度并且我得到了：

len("♊") # 3
len("") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("", "utf-8")) # 1 Oh wait, no it doesn't.

有什么想法吗？此站点在Character.charCount（）行中具有表情符号长度：http://www.fileformat.info/info/unicode/char/1F601/index.htm

Answer 1

阅读sys.maxunicode：

给出最大Unicode代码点值的整数，即   1114111（十六进制0x10FFFF）。

在版本3.3中更改：在PEP 393之前，sys.maxunicode曾经使用过   可以是0xFFFF或0x10FFFF，具体取决于配置   指定Unicode字符是否存储为的选项   UCS-2或UCS-4。

以下脚本应该适用于Python版本2和3：

# coding=utf-8

from __future__ import print_function
import sys, platform, unicodedata

print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601'                          ###   unistr = u'♊ '
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
    print (len(char), tab, char, tab, repr(char), tab, 
        unicodedata.category(char), tab, unicodedata.name(char,'private use'))

输出显示不同sys.maxunicode属性值的后果。例如，字符（Basic Multilingual Plane上方的unicode代码点0x1f601）将转换为对应的surrogate pair（代码点u'\ud83d'和u'\ude01'） sys.maxunicode的结果为0xFFFF：

PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8

PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3      ♊    '♊ '
1      ♊      '♊'      So      GEMINI
1             ' '      Zs      SPACE
1           ''      So      GRINNING FACE WITH SMILING EYES

PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4      ♊    u'\u264a \U0001f601'
1      ♊      u'\u264a'    So      GEMINI
1             u' '         Zs      SPACE
1      ��     u'\ud83d'    Cs      private use
1      ��     u'\ude01'    Cs      private use

注意：上面的输出示例来自支持Unicode的 Powershell-ISE console pane。

获得适当长度的表情符号

1 个答案: