Question

使用Python 2.7，我从网站上抓取一些HTML作为字符串并立即将其解码为unicode。因为我需要稍后知道发生任何解码错误的地方，我认为最好使用errors =“replace”来防止非ASCII字符的异常：

linkname = curlinkname.decode("utf-8", errors="replace")

在大多数情况下，这会使用占位符替换问题字符。但是，当我运行代码时，我仍然从一行中获得一个特殊字符（ū）的异常：

UnicodeEncodeError: 'charmap' codec can't encode character u'\u016b' in position 1: character maps to <undefined>

发生了什么事？

Answer 1

您需要先安装lib

pip install chardet

然后使用它

import chardet
code = chardet.detect(curlinkname)
linkname = curlinkname.decode(code['encoding'], errors="replace")