Question

从这个网站http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31

<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>

我正在抓取文字并尝试获取百度汇总

但当我r.encoding = 'utf-8'时，结果为�ٶȻ��

如果我不使用utf-8，则结果为°Ù¶È»ã×Ü

Answer 1

服务器不会在响应标头中告诉您任何有用的信息，但HTML页面本身包含：

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

GB2312是一种可变宽度编码，如UTF-8。然而，页面在于;它实际上使用了GBK，这是对GB2312的扩展。

您可以使用GBK对其进行解码：

>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True

使用gb2313进行解码失败：

>>> r.content.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence

但由于GBK是GB2313的超集，即使指定了后者，使用前者也应该是安全的。

如果您使用的是requests，那么将r.encoding设置为gb2312是有效的，因为r.text在处理解码错误时会使用replace：

content = str(self.content, encoding, errors='replace')

因此，对于仅在GBK中定义的那些代码点，屏蔽了使用GB2312时的解码错误。

请注意，BeautifulSoup可以单独进行解码;它会找到meta标题：

>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

警告是由页面声称使用GB2312时使用的GBK代码点引起的。

中文Unicode问题？

1 个答案: