Question

我的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
html = urlopen("http://www.xyafc.edu.cn/xyacnews/cnews/")
news = BeautifulSoup(html,'lxml')
print(news.title.encode('utf8'))

结果：

b'<title>\xe6\xa0\xa1\xe5\x9b\xad\xe6\x96\xb0\xe9\x97\xbb</title>'

网站

http://www.xyafc.edu.cn/xyacnews/cnews/该网页的字符集是gb2312。我谷歌上网找到答案，但这些都不行。我怎样才能获得正确的news.title？

Answer 1

首先，当你想要改变html的编码时，在urlopen中进行比encode表示str＆gt;＆gt; byte，这就是你打印b'....'的方式。

只是摆脱编码。

UnicodeEncodeError：使用Python3和beautifulsoup4的crawel web

网站

1 个答案: