Question

from bs4 import BeautifulSoup

html = 'index.html'
soup = BeautifulSoup(open(html))
print len(soup.findAll('div'))

其中文件index.html是此shopping webpage的源代码。

我的代码显示只找到了1个div标记。但更奇怪的是findAll('a')返回一个巨大的（可能是正确的）数字。 span有效，但不是div。

Answer 1

您遇到BeautifulSoup使用引擎盖的differences between parsers。

选择html.parser或html5lib：

>>> from bs4 import BeautifulSoup
>>> html = 'index.html'
>>> soup = BeautifulSoup(open(html), 'html')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'lxml')
>>> len(soup.findAll('div'))
0
>>> soup = BeautifulSoup(open(html), 'html.parser')
>>> len(soup.findAll('div'))
774
>>> soup = BeautifulSoup(open(html), 'html5lib')
>>> Alen(soup.findAll('div'))
774

请注意，如果您不specify a parser，BeautifulSoup会自动提取：

如果您没有指定任何内容，您将获得最佳的HTML解析器安装。然后，Beautiful Soup将lxml的解析器列为最佳解析器 html5lib，然后是Python的内置解析器。

美丽的汤找不到所有的divs

1 个答案: