Question

如何使用带有空格的utf8替换无法解码的字符？

# -*- coding: utf-8 -*-
print unicode('\x97', errors='ignore') # print out nothing
print unicode('ABC\x97abc', errors='ignore') # print out ABCabc

如何打印ABC abc代替ABCabc？请注意，\x97只是一个示例字符。无法解码的字符是未知输入。

如果我们使用errors='ignore'，它将不打印任何内容。
如果我们使用errors='replace'，它会用一些特殊的字符替换该字符。

Answer 1

看看codecs.register_error。您可以使用它来注册自定义错误处理程序

https://docs.python.org/2/library/codecs.html#codecs.register_error

import codecs
codecs.register_error('replace_with_space', lambda e: (u' ',e.start + 1))
print unicode('ABC\x97abc', encoding='utf-8', errors='replace_with_space')

Answer 2

您可以使用try-except语句来处理UnicodeDecodeError：

def my_encoder(my_string):
   for i in my_string:
      try :
         yield unicode(i)
      except UnicodeDecodeError:
         yield '\t' #or another whietespaces

然后使用str.join方法加入您的字符串：

print ''.join(my_encoder(my_string))

演示：

>>> print ''.join(my_encoder('this is a\x97n exam\x97ple'))
this is a   n exam  ple

Python unicode：如何替换无法使用带有空格的utf8解码的字符？

2 个答案: