Question

我正在使用从网络上抓取的文字进行一些文字处理。我想在

之前解码原始文本

raw_html=  raw_html.decode("iso-8859-1")

后来编码为UTF，所以我不会遇到编码问题......

raw_html=  raw_html.encode("UTF-8")

问题在于，尽管知道网页编码，我仍然在解码部分中出现错误......

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 302: ordinal not in range(128)

我将处理多种语言，但没有那么多网页（所以我想手动设置编码）。我希望能够将所有语言（英语，法语，西班牙语，葡萄牙语）转换为可以使用的共同基础。你会建议什么？

Answer 1

如果raw_html.decode()为您提供编码例外，那么已经 Unicode：

>>> u'é'.decode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

因为Python 2在尝试“解码”Unicode值时会首先尝试编码（使用默认的ASCII编解码器）。

Unicode与编码python文本处理

1 个答案: