Question

我的bs4包有问题。

我有一个html文档，就像这样：

data = """<html><head></head><body>
<p> this is tab </p>
<img src="image.jpg">
</body></html>
"""

这是我的代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html5lib')
soup.find_all("a")

当我运行它时，bs4仍处于循环中并且不返回任何内容，可能是因为在某些HTML数据中标记a不存在。

非常感谢。
1.是的，以上示例正常工作但是，就我而言。 data是一个带有多行html字符串的变量

from bs4 import BeautifulSoup
data = open("file.htm").read()
soup = BeautifulSoup(data, 'html5lib')
soup.find_all("a")

3。请使用我的文件进行测试：file.htm
我正在使用beautifulsoup4 == 4.4.1。 Python 3.5.1
再次感谢。

Answer 1

尝试使用内置版html.parser，它甚至可以使用无效的HTML。

from bs4 import BeautifulSoup

data = """<html><head></head><body>
<p> this is tab </p>
<img src="image.jpg">
</body></html>
"""

soup = BeautifulSoup(data, 'html.parser')
soup.find_all("a")

Answer 2

我不明白为什么在使用find_all时你的程序会挂起，如果html页面很大但可能不会挂起，可能需要一段时间。

以下是您可以尝试的一些事项：

如果您在解析之前下载网页，则可能会导致挂起。使用pdb检测程序的确切位置，将此行添加到代码的开头import pdb; pdb.set_trace()并从那里跟踪
确保您通过运行Html5Lib安装pip freeze | grep html5lib，如果不存在，请安装pip install html5lib
在类似的SO question中，有人提到他们通过升级BeautifulSoup来解决问题，请尝试使用：pip install --upgrade beautifulsoup4

在BeautifulSoup doc中，他们建议使用某些Python版本的特定解析器：

如果可以的话，我建议你安装并使用lxml来提高速度如果您使用的是早于2.7.3的Python 2版本，或者早于3.2.2的Python 3版本，则必须安装lxml或html5lib - Python的内置HTML解析器在旧版本中不是很好。

使用find时，BeautifulSoup挂起

2 个答案: