我有一个场景,其中发送用于分析的日志文件有一些非ascii字符,最终破坏了我无法控制的分析工具之一。因此决定自己清理日志并提出以下功能,除了我在看到这些特征时最终跳过整行。一世 尝试逐行逐字符(检查注释)代码,以便只消除那些字符并保存实际的ascii但不能成功。 这个注释逻辑和建议/解决方案失败的原因是什么原因解决了这个问题?
失败的样本行:
1:02:54.934 / 174573的 ENQ 我NULSUB AY的 NULEOT / 29 / abcdefghijg
读取和删除行的功能:
def readlogfile(self, abs_file_name):
"""
Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
abs_file_name file name should be absolute path
"""
try:
infile = open(abs_file_name, 'rb')
for line in infile:
try:
line.decode('ascii')
self._data_bytes.append(line)
except UnicodeDecodeError as e :
# print line + "Invalid line skipped in " + abs_file_name
print line
continue
# while 1: #code that didn't work to remove just the non-ascii chars
# char = infile.read(1) # read characters from file
# if not char or ord(char) > 127 or ord(char) < 0:
# continue
# else:
# sys.stdout.write(char)
# #sys.stdout.write('{}'.format(ord(char)))
# #print "%s ord = %d" % (char, ord(char))
# self._data_bytes.append(char)
finally:
infile.close()
答案 0 :(得分:1)
解码需要另一个参数,如何处理坏字符。 https://docs.python.org/2/library/stdtypes.html#string-methods
试试这个
print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")
u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg'
你的代码可以减少到类似的东西
def readlogfile(self, abs_file_name):
"""
Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
abs_file_name file name should be absolute path
"""
with open(abs_file_name, 'rb') as infile:
while True:
line = infile.readline()
if not line:
break
self._data_bytes.append(line.decode("ascii", "ignore"))
答案 1 :(得分:0)
我认为这是在逐个字符的基础上处理违规行的正确方法:
import codecs
class MyClass(object):
def __init__(self):
self._data_bytes = []
def readlogfile(self, abs_file_name):
"""
Reads and skips the non-ascii chars line from the attached log file and
populate the list self.data_bytes abs_file_name file name should be
absolute path
"""
with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile:
for line in infile:
try:
line.decode('ascii')
except UnicodeError as e:
ascii_chars = []
for char in line:
try:
char.decode('ascii')
except UnicodeError as e2:
continue # ignore non-ascii characters
else:
ascii_chars.append(char)
line = ''.join(ascii_chars)
self._data_bytes.append(str(line))