应用错误收集

使用Beautiful Soup 4解析不平衡的html文件

时间：2017-01-23 18:24:13

标签： python html beautifulsoup

我正在解析部分没有平衡html标签的html文件。

假设此部分html文件中缺少第一行。 Beautiful Soup是否仍然可以解析其余的文件，我仍然可以提取不同标签的内部信息？

非常感谢您的帮助。

int

1 个答案:

答案 0 :(得分：0)

使用任何高级解析器（html5lib更强大，但速度更慢）。结果会有所不同：

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>