Question

我有一个场景，其中发送用于分析的日志文件有一些非ascii字符，最终破坏了我无法控制的分析工具之一。因此决定自己清理日志并提出以下功能，除了我在看到这些特征时最终跳过整行。一世尝试逐行逐字符（检查注释）代码，以便只消除那些字符并保存实际的ascii但不能成功。这个注释逻辑和建议/解决方案失败的原因是什么原因解决了这个问题？

失败的样本行：

1：02：54.934 / 174573的 ENQ 我NULSUB AY的 NULEOT / 29 / abcdefghijg

读取和删除行的功能：

def readlogfile(self, abs_file_name):
    """
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
    abs_file_name file name should be absolute path
    """
    try:
        infile = open(abs_file_name, 'rb')
        for line in infile:
            try:
                line.decode('ascii')
                self._data_bytes.append(line)
            except UnicodeDecodeError as e :
                # print line + "Invalid line skipped in " + abs_file_name
                print line
                continue
            # while 1: #code that didn't work to remove just the non-ascii chars
            #     char = infile.read(1)          # read characters from file
            #     if not char or ord(char) > 127 or ord(char) < 0:
            #         continue
            #     else:
            #         sys.stdout.write(char)
            #         #sys.stdout.write('{}'.format(ord(char)))
            #         #print "%s ord = %d" % (char, ord(char))
            #         self._data_bytes.append(char)
    finally:
        infile.close()

Answer 1

解码需要另一个参数，如何处理坏字符。 https://docs.python.org/2/library/stdtypes.html#string-methods

试试这个

print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")

u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg'

你的代码可以减少到类似的东西

def readlogfile(self, abs_file_name):
    """
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
    abs_file_name file name should be absolute path
    """
    with open(abs_file_name, 'rb') as infile:
        while True:
            line = infile.readline()
            if not line:
                break
            self._data_bytes.append(line.decode("ascii", "ignore"))

Answer 2

我认为这是在逐个字符的基础上处理违规行的正确方法：

import codecs

class MyClass(object):
    def __init__(self):
        self._data_bytes = []

    def readlogfile(self, abs_file_name):
        """
        Reads and skips the non-ascii chars line from the attached log file and
        populate the list self.data_bytes abs_file_name file name should be
        absolute path
        """
        with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile:
            for line in infile:
                try:
                    line.decode('ascii')
                except UnicodeError as e:
                    ascii_chars = []
                    for char in line:
                        try:
                            char.decode('ascii')
                        except UnicodeError as e2:
                            continue  # ignore non-ascii characters
                        else:
                            ascii_chars.append(char)
                    line = ''.join(ascii_chars)
                self._data_bytes.append(str(line))

使用python从文件中删除非ascii字符

2 个答案: