Question

我处理一些twitter数据，我想过滤列表中的表情符号。数据本身以utf8编码。我像这三个示例行一样逐行读取文件：

['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']
['This', 'is', 'another', 'tweet', 'with', 'a', 'emoticon', '']
['This', 'tweet', 'contains', 'no', 'emoticon']

我想像这样收集每一行的表情符号：

['', '⚓️']

等等。

我已经研究过，发现python中有一个'表情符号'包。我尝试在我的代码中使用它

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []
        data = re.findall(r'\X', elements)
        for word in data:
            if any(char in emoji.UNICODE_EMOJI for char in word):
                emoji_list.append(word)

首先尝试

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []

        for c in elements:
            if c in emoji.UNICODE_EMOJI:
                emojilist.append(c)

第二次尝试

我尝试了这里给出的例子How to extract all the emojis from text?，但它们对我不起作用，我不确定我做错了什么。

我非常感谢提取表情符号的任何帮助，提前感谢！：）

Answer 1

Emojis存在于几个Unicode范围内，由此正则表达式模式表示：

>>> import re
>>> emoji = re.compile('[\\u203C-\\u3299\\U0001F000-\\U0001F644]')

您可以使用它来过滤您的列表：

>>> list(filter(emoji.match, ['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']))
['', '⚓️']

N.B。：模式是近似值，可能会捕获一些额外的字符。

在列表中提取Unicode-Emoticons，Python 3.x

1 个答案: