Question

为了记录utf-8字符串，有什么我需要手工记录打印为我做的事情吗？

for line in unicodecsv.reader(cfile, encoding="utf-8"):
    for i in line:
        print "process_clusters: From CSV: %s" % i
        print "repr: %s" % repr(i)
        log.debug("process_clusters: From CSV: %s", i)

无论字符串是拉丁语还是俄语西里尔语，我的print语句都能正常工作。

process_clusters: From CSV: escuchan
repr: u'escuchan'
process_clusters: From CSV: говоритъ
repr: u'\u0433\u043e\u0432\u043e\u0440\u0438\u0442\u044a'

但是，log.debug不会让我传入同一个变量。我收到这个错误：

Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/logging/__init__.py", line 765, in emit
    self.stream.write(fs % msg.encode("UTF-8"))
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 686, in write
    return self.writer.write(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 28: ordinal not in range(128)

我的日志，格式化程序和处理程序是：

log = logging.getLogger(__name__)
loglvl = getattr(logging, loglevel.upper()) # convert text log level to numeric
log.setLevel(loglvl) # set log level
handler = logging.FileHandler('inflection_finder.log', 'w', 'utf-8')
handler.setFormatter(logging.Formatter('[%(levelname)s] %(message)s'))
log.addHandler(handler)

我正在使用Python 2.6.7。

Answer 1

通过回溯读取，似乎日志模块在写入消息之前尝试对消息进行编码。该消息被假定为ASCII字符串，它不能是因为它包含UTF-8字符。如果在将消息传递给记录器之前将其转换为Unicode，则可能会有效。

    log.debug(u"process_clusters: From CSV: %s", i)

编辑，我注意到你的参数字符串已经解码为Unicode，所以我相应地更新了这个例子。

同样基于您的最新编辑，您可能希望在设置中使用Unicode字符串：

handler.setFormatter(logging.Formatter(u'[%(levelname)s] %(message)s'))
                                     --^--

Answer 2

Python2中的所有字符串与unicode都是一团糟...幸运的是在Python3中得到了纠正。但是假设转移到Python3不是一个选项，它就在这里。

我认为，在编码方面使用logging时有两种选择：

以二进制文件打开文件，并将所有字符串用作字节字符串。
以文本形式打开文件，并将所有字符串用作unicode-strings。

任何其他选择都注定要失败！

问题是您在打开文件时指定编码"utf-8"，但您的俄语文本是unicode字符串。

所以你可以做一件（只有一件）：

以二进制文件打开文件，即从"utf-8"构造中删除FileStream参数。
将所有相关文本转换为unicode字符串，其中包括log.debug的参数和logging.Formatter的参数。

为什么相同的utf-8字符串在打印和记录失败方面都很好？

2 个答案: