Question

对于给定的字符串，我正在尝试计算每个单词和表情符号的出现次数。对于仅由1个表情符号组成的表情符号，我已经为here做了。问题是很多当前的表情符号都是由一些表情符号组成的。

就像表情符号由四个表情符号组成 - 和表情符号与人类肤色一样，例如是等等。

问题归结为如何以正确的顺序拆分字符串，然后计算它们很容易。

有一些很好的问题可以解决同一问题，例如link1和link2，但它们都不适用于一般解决方案（或解决方案已过时或我无法理解出）。

例如，如果字符串为hello ‍ emoji hello ‍‍‍，那么我将{'hello':2, 'emoji':1, '‍‍‍':1, '‍':1} 我的字符串来自Whatsapp，所有字符串都是用utf8编码的。

我有很多不好的尝试。帮助将不胜感激。

Answer 1

使用第三方regex模块，该模块支持识别字形集群（呈现为单个字符的Unicode代码点序列）：

>>> import regex
>>> s='‍‍‍'
>>> regex.findall(r'\X',s)
['\u200d\u200d\u200d', '']
>>> for c in regex.findall('\X',s):
...     print(c)
... 
‍‍‍

计算它们：

>>> data = regex.findall(r'\X',s)
>>> from collections import Counter
>>> Counter(data)
Counter({'\u200d\u200d\u200d': 1, '': 1})

Answer 2

非常感谢Mark Tolonen。现在，为了计算给定字符串中的单词和表情符号以及单词，我将使用emoji.UNICOME_EMOJI来确定什么是表情符号，什么不是（来自emoji包），以及然后从字符串中删除表情符号。

目前不是一个理想的答案，但它有效，我会编辑它是否会被更改。

import emoji
import regex
def split_count(text):
    total_emoji = []
    data = regex.findall(r'\X',text)
    flag = False
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):  
            total_emoji += [word] # total_emoji is a list of all emojis

    # Remove from the given text the emojis
    for current in total_emoji:
        text = text.replace(current, '') 

    return Counter(text.split() + total_emoji)


text_string = "here hello world hello‍‍‍"    
final_counter = split_count(text_string)

输出：

final_counter
Counter({'hello': 2,
         'here': 1,
         'world': 1,
         '\u200d\u200d\u200d': 1,
         '': 5,
         '': 1})

Answer 3

emoji.UNICODE_EMOJI 是一本有结构的字典

{'en': 
    {'?': ':1st_place_medal:',
     '?': ':2nd_place_medal:',
     '?': ':3rd_place_medal:' 
... }
}

因此您需要使用 emoji.UNICODE_EMOJI['en'] 才能使上述代码工作。

在Python中拆分和计算给定字符串中的表情符号和单词

3 个答案: