Question

考虑以下列表：

a_list = ['  me así, bla es se  ds ']

如何在新列表中提取a_list内的所有表情符号？：

new_lis = ['     ']

我尝试使用正则表达式，但我没有所有可能的表情符号编码。

Answer 1

您可以使用emoji库。您可以通过检查单个代码点是否包含在emoji.UNICODE_EMOJI中来检查单个代码点是否为表情符号代码点。

import emoji

def extract_emojis(str):
  return ''.join(c for c in str if c in emoji.UNICODE_EMOJI)

Answer 2

我认为重要的是要指出之前的答案不会与表情符号一起使用，因为它包含4个表情符号，使用... in emoji.UNICODE_EMOJI将返回4个不同的表情符号。对于表情符号的表情符号也是如此。

我的解决方案包括emoji和regex模块。正则表达式模块支持识别字形集群（呈现为单个字符的Unicode代码点序列），因此我们可以计算表情符号

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)

    return emoji_list

测试（肤色更多的表情符号）：

line = ["  me así, se  ds  hello ‍ emoji hello ‍‍‍ how are  you today"]

counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))

输出：

      ‍ ‍‍‍

编辑：

如果要包含标志，Unicode范围将从到，请添加：

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text)

到上面的函数，return emoji_list + flags。

有关标志的更多信息，请参阅this post。

Answer 3

如果您不想使用外部库，那么您可以简单地使用正则表达式和re.findall()使用正确的正则表达式来查找表达式：

In [74]: import re
In [75]: re.findall(r'[^\w\s,]', a_list[0])
Out[75]: ['', '', '', '', '', '']

正则表达式r'[^\w\s,]'是一个否定的字符类，它匹配任何不是单词字符，空格或逗号的字符。

正如我在评论中提到的，文本通常包含单词字符和标点符号，这种方法很容易处理，对于其他情况，您可以手动将它们添加到字符类中。请注意，由于您可以在字符类中指定一系列字符，因此您甚至可以使其更短更灵活。

另一种解决方案是排除非表情符号字符的否定字符类，而不是使用接受表情符号的字符类（[]没有^）。由于有很多表情符号with different unicode values，您只需要将范围添加到字符类中。如果你想在这里匹配更多的表情符号是一个很好的参考，包含所有标准的表情符号，以及不同表情符号的相应范围http://apps.timwhitlock.info/emoji/tables/unicode：

Answer 4

评价最高的答案并不总是有效。例如，将找不到标志表情符号。考虑字符串：

s = u'Hello \U0001f1f7\U0001f1fa hello'

更好的方法是

import emoji
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
print(' '.join(r.findall(s)))

Answer 5

获得风滚草所要求的解决方案是最受好评的答案和用户594836的答案之间的混合。这是在Python 3.6中适用于我的代码。

import emoji
import re

test_list=['  me así,bla es,se  ds ']

## Create the function to extract the emojis
def extract_emojis(a_list):
    emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
    r = re.compile('|'.join(re.escape(p) for p in emojis_list))
    aux=[' '.join(r.findall(s)) for s in a_list]
    return(aux)

## Execute the function
extract_emojis(test_list)

## the output
['     ']

Answer 6

第1步：确保您的文字已在utf-8 text.decode('utf-8')

上解码

第2步：找到文字中的所有表情符号，您必须按字符 [str for str in decode]

分隔文字字符

第3步：将所有表情符号保存在列表中 [c for c in allchars if c in emoji.UNICODE_EMOJI] 完整示例：

>>> import emoji
>>> text     = "  me así, bla es se  ds "
>>> decode   = text.decode('utf-8')
>>> allchars = [str for str in decode]
>>> list     = [c for c in allchars if c in emoji.UNICODE_EMOJI]
>>> print list
[u'\U0001f914', u'\U0001f648', u'\U0001f60c', u'\U0001f495', u'\U0001f46d', u'\U0001f459']

如果你想从文本中删除

>>> filtred  = [str for str in decode.split() if not any(i in str for i in list)]
>>> clean_text = ' '.join(filtred)
>>> print clean_text
me así, bla es se ds

Answer 7

使用emoji的另一种方法是使用emoji.demojize并将其转换为表情符号的文本表示形式。

例如：?将转换为:grinning_face: etc..

然后找到所有:.*:模式，并在这些模式上使用emoji.emojize。

# -*- coding: utf-8 -*-
import emoji
import re

text = """
Of course, too many emoji characters \
? like ?, #@^!*&#@^# ? helps ? people read ?aa?aaa?a #douchebag
"""

text = emoji.demojize(text)
text = re.findall(r'(:[^:]*:)', text)
list_emoji = [emoji.emojize(x) for x in text]
print(list_emoji)

这可能是多余的方式，但这是如何使用emoji.emojize和emoji.demojize的一个示例。

Answer 8

首先，您需要安装此软件：

conda install -c conda-forge emoji

现在我们可以编写以下代码：

import emoji
import re
text= '? ? me así, bla es se ? ds ???'
text_de= emoji.demojize(text)

如果我们打印text_de输出为：

':thinking_face: :see-no-evil_monkey: me así, bla es se :relieved_face: ds 
 :two_hearts::two_women_holding_hands::bikini:'

现在我们可以使用正则表达式来查找表情符号。

emojis_list_de= re.findall(r'(:[!_\-\w]+:)', text_de)
list_emoji= [emoji.emojize(x) for x in emojis_list_de]

如果我们打印lis_emoji，则输出：

['?', '?', '?', '?', '?', '?']

因此，我们可以使用Join函数：

[''.join(list_emoji)]
OutPut: ['??????']

如果要删除表情符号，可以使用以下代码：

def remove_emoji(text):
   '''
   remove all of emojis from text
   -------------------------
   '''
   text=  emoji.demojize(text)
   text= re.sub(r'(:[!_\-\w]+:)', '', text)

   return text

Answer 9

from emoji import *

EMOJI_SET = set()

# populate EMOJI_DICT
def pop_emoji_dict():
    for emoji in UNICODE_EMOJI:
        EMOJI_SET.add(emoji)

# check if emoji
def is_emoji(s):
    for letter in s:
        if letter in EMOJI_SET:
            return True
    return False

当使用大型数据集时，这是一个更好的解决方案，因为您不必每次都遍历所有表情符号。发现这给我更好的结果：）

Answer 10

导入表情符号
new_list = emojis.get（'??measí，bla ses se ds ds???'）
打印（新列表）\

输出>>> {'?'，'?'，'?'，'?'，'?'，'?'}

Answer 11

这是另一个使用 emoji.get_emoji_regexp() 和 f(x) = Theta(g(x)) 的选项：

re

这产生：

import re
import emoji

def extract_emojis(text):
    return re.findall(emoji.get_emoji_regexp(), text)

test_str = '? some ? various ? emojis ??‍? and ?? flags ?‍?‍?‍?'
emojis = extract_emojis(test_str)

或者，查看字形簇：

['?', '?', '?', '??\u200d?', '??', '?\u200d?\u200d?\u200d?']

收益

print(' '.join(emoji for emoji in emojis))

Answer 12

好-我遇到了同样的问题，我制定了一个解决方案，它不需要您导入任何库（例如emoji或re），并且只需一行代码。它将返回字符串中的所有表情符号：

@serialiazeUsingOnlyProperties({"property1", "property2"})
public B someB;

这使我能够创建一个轻量级的解决方案，希望对您有所帮助。实际上-我需要一个可以过滤出字符串中的表情符号的东西-与上面的代码相同，但有一点改动：

def extract_emojis(sentence):
    return [word for word in sentence.split() if str(word.encode('unicode-escape'))[2] == '\\' ]

以下是实际操作的示例：

a ='我是阿西，bla se ds ds'
b = extract_emojis（a）
b = [''，''，``，'']

Answer 13

此函数需要一个字符串，因此将输入列表转换为字符串

a_list = '  me así, bla es se  ds '

# Import the necessary modules
from nltk.tokenize import regexp_tokenize

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680- 
 \U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"

print(regexp_tokenize(a_list, emoji)) 

output :['', '', '', '', '']

Answer 14

所有带有各自代码点的Unicode表情符号都是here。它们是1F600到1F64F，因此您可以使用类似范围的迭代器构建所有这些。

如何从文本中提取所有表情符号？

14 个答案: