Question

不熟悉Python生态系统，或者通常使用网络抓取。所以我试图从中文网站上删除内容。

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.baidu.com/")
r.encoding = 'utf-8'

text = r.text

soup = BeautifulSoup(text.encode('utf-8','ignore'), 'html.parser')

print soup.prettify()

问题是，这段代码对我有用，但它对每个人都不起作用，而且我对字符编码或python生态系统不够了解，无法解决问题。我正在运行Python 2.7.10，但是使用Python 2.7.12在另一台计算机上运行相同的代码块会导致以下错误：＆＃34; UnicodeEncodeError：＆＃39; ascii＆＃39;编解码器不能对位置369-377中的字符进行编码：序数不在范围内（128）＆＃34;

所以我想我的问题确实如下：

导致此错误的原因是什么？我如何修复此代码以使其更具可移植性？

提前感谢您的任何指导或指示。

Answer 1

我认为您不需要为请求指定编码。因为r.text已经完成了编码转换工作，而r.content是原始数据。

见文件：

 |  text
 |      Content of the response, in unicode.
 |      
 |      If Response.encoding is None, encoding will be guessed using
 |      ``chardet``.
 |      
 |      The encoding of the response content is determined based solely on HTTP
 |      headers, following RFC 2616 to the letter. If you can take advantage of
 |      non-HTTP knowledge to make a better guess at the encoding, you should
 |      set ``r.encoding`` appropriately before accessing this property.

所以你只需要配置响应的编码，而不是请求＆＃39;编码

所以代码应该是这样的：

print r.encoding
r.encoding = "urf8"
print r.text

使用Python抓取亚洲语言网站的网站

1 个答案: