使用python从文件中删除非ascii字符

时间:2015-11-04 16:26:34

标签: python

我有一个场景,其中发送用于分析的日志文件有一些非ascii字符,最终破坏了我无法控制的分析工具之一。因此决定自己清理日志并提出以下功能,除了我在看到这些特征时最终跳过整行。一世             尝试逐行逐字符(检查注释)代码,以便只消除那些字符并保存实际的ascii但不能成功。             这个注释逻辑和建议/解决方案失败的原因是什么原因解决了这个问题?

失败的样本行:

  

1:02:54.934 / 174573的 ENQ 我NULSUB AY的 NULEOT / 29 / abcdefghijg

读取和删除行的功能:

def readlogfile(self, abs_file_name):
    """
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
    abs_file_name file name should be absolute path
    """
    try:
        infile = open(abs_file_name, 'rb')
        for line in infile:
            try:
                line.decode('ascii')
                self._data_bytes.append(line)
            except UnicodeDecodeError as e :
                # print line + "Invalid line skipped in " + abs_file_name
                print line
                continue
            # while 1: #code that didn't work to remove just the non-ascii chars
            #     char = infile.read(1)          # read characters from file
            #     if not char or ord(char) > 127 or ord(char) < 0:
            #         continue
            #     else:
            #         sys.stdout.write(char)
            #         #sys.stdout.write('{}'.format(ord(char)))
            #         #print "%s ord = %d" % (char, ord(char))
            #         self._data_bytes.append(char)
    finally:
        infile.close()

2 个答案:

答案 0 :(得分:1)

解码需要另一个参数,如何处理坏字符。 https://docs.python.org/2/library/stdtypes.html#string-methods

试试这个

print "1:02:54.934/174573ENQÎNULSUBáyNULEOT/29/abcdefghijg".decode("ascii", "ignore")

u'1:02:54.934/174573ENQNULSUByNULEOT/29/abcdefghijg'

你的代码可以减少到类似的东西

def readlogfile(self, abs_file_name):
    """
    Reads and skip the non-ascii chars line from the attached log file and populate the list self.data_bytes
    abs_file_name file name should be absolute path
    """
    with open(abs_file_name, 'rb') as infile:
        while True:
            line = infile.readline()
            if not line:
                break
            self._data_bytes.append(line.decode("ascii", "ignore"))

答案 1 :(得分:0)

我认为这是在逐个字符的基础上处理违规行的正确方法:

import codecs

class MyClass(object):
    def __init__(self):
        self._data_bytes = []

    def readlogfile(self, abs_file_name):
        """
        Reads and skips the non-ascii chars line from the attached log file and
        populate the list self.data_bytes abs_file_name file name should be
        absolute path
        """
        with codecs.open(abs_file_name, 'r', encoding='utf-8') as infile:
            for line in infile:
                try:
                    line.decode('ascii')
                except UnicodeError as e:
                    ascii_chars = []
                    for char in line:
                        try:
                            char.decode('ascii')
                        except UnicodeError as e2:
                            continue  # ignore non-ascii characters
                        else:
                            ascii_chars.append(char)
                    line = ''.join(ascii_chars)
                self._data_bytes.append(str(line))