Python专家:
我有一句话:
"this time air\u00e6\u00e3o was filled\u00e3o"
我希望删除非Ascii unicode字符。
我可以使用以下代码和函数:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
sentence = "this time air\u00e6\u00e3o was filled\u00e3o"
sentence = removeNonAscii(sentence)
print(sentence)
然后显示:"this time airo was filledo"
,非常适合删除“\ 00 ..”
但是当我在一个文件中写下这个句子,然后把它读成一个循环:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
hand = open('test.txt')
for sentence in hand:
sentence = removeNonAscii(sentence)
print(sentence)
显示"this time air\u00e6\u00e3o was filled\u00a3o"
它根本不起作用。这里发生了什么?如果该功能有效,则不应该
那样......
答案 0 :(得分:2)
我有一种感觉,你的文件中的文字实际上是显示字符的utf-8序列而不是实际的non-ascii
字符,而不是你认为的任何字符,它实际上是代码\u00--
等等,当你运行代码时,它会读取每个字符并看到它们完全正常,因此过滤器会离开它们。
如果是这种情况,请使用:
import re
def removeNonAscii(s):
return re.sub(r'\\u\w{4}','',s)
它将带走&#39; \ u ----&#39;
的所有实例示例:
>>> with open(r'C:\Users\...\file.txt','r') as f:
for line in f:
print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo
其中file.txt包含:
这次air \ u00e6 \ u00e3o被填满了\ u00a3o