Question

在unicode中，角色可以拥有Emoji property。

Python中是否有标准方法来确定某个字符是否为表情符号？

我知道unicodedata，但它似乎没有公开所有这些额外的角色细节。

注意：我在询问名为＆＃34; Emoji＆＃34;的具体属性。在unicdoe标准中，如链接中所提供。我不想拥有任意模式范围列表，最好使用标准库。

Answer 1

根据我链接的问题，没有要检查的内置属性，但您可以使用您提供的页面创建自己的模式：

import urllib.request as ur
import re

html = str(ur.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt').read())
codes=list(map(lambda x: '-'.join(['\\U'+a.zfill(8) for a in x.split('..')]).encode().decode('unicode-escape'),re.findall(r'(?<=\\n)[\w.]+',html)))
emojiPattern = re.compile('['+','.join(codes)+']',flags=re.UNICODE)

使用emojiPattern.match将与该页面中包含的unicode代码进行比较。如果它更新/上传了另一个版本，只需更改它。

Answer 2

这是我最终创建的用于加载表情符号信息的代码。 get_emoji函数获取数据文件，解析它，并调用enumeraton回调。其余的代码使用它来生成我需要的信息的JSON文件。

#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json

'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).

@param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
@param attribute  The attribute that  is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
    with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
        content = f.read().decode(f.headers.get_content_charset())

        cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
        for line in content.splitlines():
            m = cldr.match(line)
            if m == None:
                continue

            line_attribute = m.group(5).strip()
            if line_attribute != attribute:
                continue

            code_point = int(m.group(1),16)
            if m.group(3) == None:
                on_emoji(code_point)
            else:
                to_code_point = int(m.group(3),16)
                for i in range(code_point,to_code_point+1):
                    on_emoji(i)


# Dumps the values into a JSON format
def print_emoji(value):
    c = chr(value)
    try:
        obj = {
            'code': value,
            'name': unicodedata.name(c).lower(),
        }
        print(json.dumps(obj),',')
    except:
        # Unicode DB is likely outdated in installed Python
        pass

print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )

这解决了我原来的问题。要回答问题本身，只需将结果粘贴到字典中并进行查找。

Answer 3

我在

之前成功使用了以下正则表达式模式

import re

emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)

另请查看此问题：removing emojis from a string in Python

如何在Python中检查字符的表情符号属性？

3 个答案: