我尝试使用regex清理python中的数据twitter,但无法删除\ u2764 \ ufe0f \ u2026。 twitter数据位于datas.txt文件中,这是数据:
“ Berkat biznet aku bisa在线terimakasih BiznetHome \ u2764 \ ufe0f
Gangguan hari sabtu perbaikan nanti senin hari当前离线慢响应\ u2764 \ ufe0f Terima kasih TelkomCare masalah indihome sy sudah terselesaikan terima kasih快速响应reusnya terus selalu tingka \ u2026 TelkomCare Sudah beres fix Internet dandan
我尝试了三种方法:
第一
import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
emoji_pattern = re.compile(r'\\\\u\w+')
for i in mylist:
print(emoji_pattern.sub(r'', i))
第二
import re
f = open('datas.txt', 'r')
data = f.read()
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u'\U00010000-\U0010ffff'
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\u3030"
u"\ufe0f"
"]+", flags=re.UNICODE)
emoji_pattern.sub(r'', data)
第三名
f= open("datas.txt", "r", encoding="UTF-8")
datas = f.read()
data = datas.encode('ascii', 'ignore').decode("utf-8")
print(data)
但仍然无法正常工作
答案 0 :(得分:0)
您的文本文件包含根据how Python encodes Unicode literals in source code编码的非ASCII Unicode代码点。您可以执行以下两项操作:
\uXXXX
或\UXXXXXXXX
序列。这将删除以Python文字格式编写的所有Unicode代码点,原则上(尽管不一定),它将是非ASCII字符。例如,可以这样做:import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
unicode_literal = re.compile(r'\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}')
for i in mylist:
print(unicode_literal.sub(r'', i))
# Note file is read in byte mode
with open ('datas.txt', 'rb') as f:
mylist = [line for line in f]
for i in mylist:
print(mylist.decode('unicode-escape'))