import re
data2 = ''
file = open('twitter.txt', 'r')
for i in file:
thing = re.sub(r'[^\x00-\x7f]',r'', str(file[i]))
print(str(thing))
嗨,我是Python的新手。使用Python从Twitter抓取大量数据后,我将数据放入文本文件中。文本文件最后带有很多表情符号和其他无法转换为字符串的非ASCII字符。上面的代码是我尝试删除非ASCII字符并将文件转换为字符串的尝试,但最终给了我错误:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1607: character maps to <undefined>
如何删除非ASCII字符,然后将其余文本转换为字符串?
答案 0 :(得分:1)
def return_only_ascii(str)
return ''.join([x for x in str if ord(x) < 128])
def return_only_ascii(str)
return ''.join([x for x in str if x.isascii()])
>>> return_only_ascii('José')
'Jos'