Question

我试图将给定文本中的表情符号与其他字符/单词/表情符号分开。我想稍后使用表情符号作为文本分类中的功能。因此，我必须单独处理句子中的每个表情符号并将其作为单独的字符处理。

代码：

import re

text = "I am very #happy man but my wife is not "
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

#padding the emoji with space at both the ends
new_text = reg.sub(' \1 ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emoji in new_text
new_text2 = reg.sub('#\1#', new_text) 
print(new_text2) # line c

以下是实际输出：

（我必须粘贴屏幕截图，因为从终端粘贴输出的副本会扭曲那些已经在b和c行中扭曲的表情符号）

这是我的预期输出：

I am very #happy man but my wife is not 
I am very #happy man but     my wife   is not     
I am very #happy man but ##  ##  my wife ##  is not  ##  ##

问题：

1）为什么搜索和替换不能按预期工作？什么是表情符号被替换？（第b行）。它绝对不是原始表情符号的unicode，否则c行会在两端用#padded打印表情符号。

2）我不确定我对此是否正确但是 - 为什么用一个表情符号/ unicode替换分组的表情符号？（第b行）

Answer 1

这里有几个问题。

正则表达式模式中没有捕获组，但在替换模式中，您将doRefresh(e){ this.service.getTask() .subscribe( data => console.log(data), err => console.log(err) , () => e.complete() ); }反向引用定义到组1 - 因此，最自然的解决方法是使用对0的反向引用，即整个匹配，即\1。
替换中的\g<0>实际上并未被解析为反向引用，而是作为具有八进制值1的char，因为常规（非原始）字符串文字中的反斜杠形成转义序列。在这里，它是一个八进制逃脱。
\1之后的+表示正则表达式引擎必须匹配与字符类匹配的一个或多个文本，因此您匹配表情符号的序列而不是每个单独的表情符号。

使用

请参阅Python demo打印

import re

text = "I am very #happy man but my wife is not "
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]', 
    re.UNICODE)

#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text) 
print(new_text2) # line c

Python表情符号搜索和替换不按预期工作

1 个答案: