我试图从字符串中取出所有表情符号字符(如清洁剂)。但我找不到一套完整的表情符号值。
整套表情符号是什么? UTF16值?
答案 0 :(得分:7)
Unicode标准Unicode® Technical Report #51 includes表情符号列表(emoji-data.txt):
...
21A9 ; text ; L1 ; none ; j # V1.1 (↩) LEFTWARDS ARROW WITH HOOK
21AA ; text ; L1 ; none ; j # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK
231A ; emoji ; L1 ; none ; j # V1.1 (⌚) WATCH
231B ; emoji ; L1 ; none ; j # V1.1 (⌛) HOURGLASS
...
我相信您希望删除此文档中列出的每个字符Default_Emoji_Style
emoji
。
除了引用这样的定义列表之外,没有办法识别Unicode中的表情符号字符。正如对FAQ的引用所说,它们分布在不同的块中。
答案 1 :(得分:3)
unicode-range: U+0080-02AF, U+0300-03FF, U+0600-06FF, U+0C00-0C7F, U+1DC0-1DFF, U+1E00-1EFF, U+2000-209F, U+20D0-214F, U+2190-23FF, U+2460-25FF, U+2600-27EF, U+2900-29FF, U+2B00-2BFF, U+2C60-2C7F, U+2E00-2E7F, U+3000-303F, U+A490-A4CF, U+E000-F8FF, U+FE00-FE0F, U+FE30-FE4F, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0;
答案 2 :(得分:1)
表情符号范围会针对每个新版本的Unicode表情符号进行更新。以下范围适用于13.0版
这是我gist的高级代码版本。
def is_contains_emoji(p_string_in_unicode):
"""
Instead of searching all chars of a text in a emoji lookup dictionary this function just
checks whether any char in the text is in unicode emoji range
It is much faster than a dictionary lookup for a large text
However it only tells whether a text contains an emoji. It does not return the found emojis
"""
range_min = ord(u'\U0001F300') # 127744
range_max = ord(u'\U0001FAD6') # 129750
range_min_2 = 126980
range_max_2 = 127569
range_min_3 = 169
range_max_3 = 174
range_min_4 = 8205
range_max_4 = 12953
if p_string_in_unicode:
for a_char in p_string_in_unicode:
char_code = ord(a_char)
if range_min <= char_code <= range_max:
# or range_min_2 <= char_code <= range_max_2 or range_min_3 <= char_code <= range_max_3 or range_min_4 <= char_code <= range_max_4:
return True
elif range_min_2 <= char_code <= range_max_2:
return True
elif range_min_3 <= char_code <= range_max_3:
return True
elif range_min_4 <= char_code <= range_max_4:
return True
return False
else:
return False
答案 3 :(得分:0)
我已经根据Joe和Doctor.Who的答案列出了名单:
U+00A9, U+00AE, U+203C, U+2049, U+20E3, U+2122, U+2139, U+2194-2199, U+21A9-21AA, U+231A, U+231B, U+2328, U+23CF, U+23E9-23F3, U+23F8-23FA, U+24C2, U+25AA, U+25AB, U+25B6, U+25C0, U+25FB-25FE, U+2600-27EF, U+2934, U+2935, U+2B00-2BFF, U+3030, U+303D, U+3297, U+3299, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0
答案 4 :(得分:-1)
如果你只处理英文字符和表情符号,我认为这是可行的。首先将您的字符串转换为UTF-16字符,然后检查每个值大于0x0xD800的字符(对于表情符号,实际上&gt; = 0xD836)应该是表情符号。
这是因为&#34; The Unicode standard permanently reserves the code point values between 0xD800 to 0xDFFF for UTF-16 encoding of the high and low surrogates&#34;当然还有英文字符(以及许多其他角色不会落在这个范围内)
但由于表情符号代码点从U1F300开始,其UFT-16值实际上属于此范围。
如果您不想自己动手,请点击此处查看quick reference for emoji UFT-16 value。