我有一条Tweets数据集。我正在尝试从这些推文中删除所有表情符号和符号。但是,我的代码没有删除某些表情符号,如,☠,❤,⭐等。如何改善尝试的内容或使用其他方法从推文中删除所有这些表情符号?我在熊猫数据报中有推文。
########## How I tried
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]")
cleanedData['text'] = cleanedData['text'].str.replace(emoji_pattern, '')
cleanedData.head(5).to_dict()//使用上述方法删除表情符号后
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'month': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9}, 'hour': {0: 3, 1: 1, 2: 1, 3: 1, 4: 1}, 'text': {0: ' are red, violets are blue, if you want to buy us , here is a CLUE Our eye & cheek palette is AL… ', 1: 'Is it too late now to say sorry ', 2: ' Oh no! Please email your order # to social & we can help . This is a newest offer!!', 3: " It's best applied with our buffer brush! \xa0", 4: ' DEAD '}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}, 'sentiments': {0: {'neg': 0.0, 'neu': 0.949, 'pos': 0.051, 'compound': 0.0772}, 1: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, 2: {'neg': 0.1, 'neu': 0.634, 'pos': 0.266, 'compound': 0.5684}, 3: {'neg': 0.0, 'neu': 0.64, 'pos': 0.36, 'compound': 0.6696}, 4: {'neg': 0.834, 'neu': 0.166, 'pos': 0.0, 'compound': -0.7213}}}
答案 0 :(得分:1)
根据数据集的需要,您可以尝试使用更广泛的正则表达式模式,例如
cleaned_data['text'] = cleaned_data['text'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
答案 1 :(得分:0)
尝试一下-不使用正则表达式:
cleaned_text = u"\U0001F600 some words then symbol \U0001F6FF".encode('ascii', 'ignore')
.decode('utf8')
我假设在推文中找到符号
答案 2 :(得分:0)
尝试使用python中的'demoji'软件包
答案 3 :(得分:0)
我可以使用正则表达式删除推文数据集中的所有表情符号,如下所示
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'', text)
df['text'] = df['text'].apply(deEmojify)