尽管UnicodeDecodeError

时间:2015-11-18 15:12:31

标签: python utf-8 character-encoding ascii itertools

我有一个监视日志文件的python 3程序。该日志包括用户编写的聊天消息等。该日志由第三方应用程序创建,我无法更改。

今天用户写了#34;텋��텋��"并且它导致程序崩溃并出现以下错误:

future: <Task finished coro=<updateConsoleLog() done, defined at /usr/local/src/bserver/logmonitor.py:48> exception=UnicodeDecodeError('utf-8',...
say "\xed\xa0\xbd\xed\xb1\x8c"\r\n', 7623, 7624, 'invalid continuation byte')>
Traceback (most recent call last):
File "/usr/lib/python3.4/asyncio/tasks.py", line 238, in _step
result = next(coro)
File "/usr/local/src/bserver/logmonitor.py", line 50, in updateConsoleLog
server_events = self.console.getUpdate()
File "/usr/local/src/bserver/console.py", line 79, in getUpdate
return self.read()
File "/usr/local/src/bserver/console.py", line 90, in read
for line in itertools.islice(log_file, log_no, None):
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7623: invalid continuation byte
ERROR:asyncio:Task exception was never retrieved

使用&#39;文件-i log.file&#39;我确定日志文件是us-ascii。这不应该是问题,因为ascii是utf-8的一个子集(据我所知)。

由于这种情况很少发生,我不介意丢失此用户输入的内容,我是否可以忽略此行或无法解码的特定字符并继续阅读文件的其余部分?

我考虑使用try: ... except UnicodeDecodeError as ...,但这意味着我无法在错误后读取日志文件中的任何内容。

代码

def read(self):
    log_no = self.last_log_no
    log_file = open(self.path, 'r')
    server_events = []
    starting_log_no = log_no
    for line in itertools.islice(log_file, log_no, None): //ERROR
        server_events.append(line)
        print(line.replace('\n', '').replace('\r', ''))

        log_no += 1
        self.last_log_no = log_no
    if (starting_log_no < log_no):
        return server_events
    return False

任何帮助或建议都将不胜感激!

1 个答案:

答案 0 :(得分:2)

字节字符串\xed\xa0\xbd\xed\xb1\x8c不是utf-8有效。也不是us-ascii,因为us-ascii只能是7位长;即\x8c大于127。

尝试使用支持字节的所有8位(例如UnicodeDecodeError)的编码打开文件,而不是忽略latin-1

log_file = open(self.path, 'r' encoding='latin-1')