Question

我正在努力让beautifulsoup使用URL，如下所示：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://proxies.org")
soup = BeautifulSoup(html.encode("utf-8"), "html.parser")
print(soup.find_all('a'))

但是，我收到了一个错误：

 File "c:\Python3\ProxyList.py", line 3, in <module>
    html = urlopen("http://proxies.org").encode("utf-8")
AttributeError: 'HTTPResponse' object has no attribute 'encode'

知道为什么吗？这可能与urlopen函数有关吗？为什么需要utf-8？

对于给出的示例（现在似乎已经过时或错误），Python 3和BeautifulSoup4似乎存在一些差异......

Answer 1

它不起作用，因为urlopen返回一个HTTPResponse对象，并且您将其视为直接HTML。您需要在响应上链接.read()方法才能获取HTML：

response = urlopen("http://proxies.org")
html = response.read()
soup = BeautifulSoup(html.decode("utf-8"), "html.parser")
print (soup.find_all('a'))

您可能还想使用html.decode("utf-8")而不是html.encode("utf-8")。

Answer 2

检查一下。

public getField(x, y) {
    return this.fields[x + this.verticalFields * y];
}

Answer 3

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://proxies.org")
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('a'))

首先，urlopen将返回类似文件的对象
BeautifulSoup可以接受类似文件的对象并自动解码，您不必担心。

Document：

要解析文档，请将其传递给BeautifulSoup构造函数。 您可以传入字符串或打开文件句柄：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

首先，文档转换为Unicode，HTML实体转换为Unicode字符

BeautifulSoup HTTPResponse没有属性编码

3 个答案: