Question

我最近正在研究python抓取知识，我想废弃一个网站

a website with two charset,utf-8 and gb2312

我收到来自beautifulsoup的警告：

某些字符无法解码，并被替换为REPLACEMENT CHARACTER。

我谷歌问题，我认为这可能是解码问题，我的代码可以顺利废弃其他网站。

所以，我该怎么办？

这是我的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup


code_type = 'utf-8'
html = urlopen("http://news.sina.com.cn/")
print(html)

bsObj = BeautifulSoup(html, "html.parser",from_encoding=code_type)

imglist = bsObj.findAll("img")
print(imglist)

Answer 1

在解析之前检测页面编码。

参考chardet repo https://github.com/chardet/chardet

您不应将code_type设置为所有页面的utf-8，检测页面的正确编码并将其修改为正确的编码。

编码检测有时可能会失败。在这种情况下，你应该准备一个dict来存储已知网站的编码，并在解析特殊页面时使用dict。

一个网站中的两个字符集，如何解析

1 个答案: