Question

这是我的代码，非常简单。由于某种原因，出现上述错误。即使我删除了text = str(html)并将soup = BeautifulSoup(text, 'html.parser')替换为soup = BeautifulSoup(html, 'html.parser')，也发生了同样的错误。发生了什么事？

with urllib.request.urlopen('https://jalopnik.com/search?q=mazda&u=&zo=-07:00') as response:
   html = response.read()  
text = str(html)  
soup = BeautifulSoup(text, 'html.parser')
print(type(soup))
soup = soup.prettify()
print(soup.find_all('div'))

Answer 1

soup = soup.prettify()返回一个 string ，并且由于您将其分配回了soup，因此在调用soup时将soup.find_all()变成一个字符串。

来自pretty printing section of the BeautifulSoup documentation：

prettify()方法会将Beautiful Soup解析树转变为格式正确的Unicode字符串。

不要用美化的字符串代替汤。 BeautifulSoup不需要修饰，仅当您要将汤变回字符串以保存到文件或进行调试时才需要。

soup = BeautifulSoup(text, 'html.parser')
print(soup.find_all('div'))

工作正常。

您也不是要使用str(html)来解码bytes对象。通常，您会使用html.decode('utf8')或类似的名称； str(html)为您提供了一个值，该值以b'开头，以'结尾

但是，BeautifulSoup完全可以自己解码字节值。它也可以直接从响应中读取：

with urllib.request.urlopen('https://jalopnik.com/search?q=mazda&u=&zo=-07:00') as response:
    soup = BeautifulSoup(response, 'html.parser')
print(soup.find_all('div'))

'str'对象没有属性'find_all'漂亮的汤

1 个答案: